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CHANCE vs. SELECTED DISTRACTORS IN A 
VOCABULARY TEST 


JOHN M. AND RUTH C. STALNAKER! 
University of Chicago 


I, THE PROBLEM 


The purpose of this study was to determine the relative effective- 
ness or drawing power of chance and subjectively selected distractors 
in a typical five-choice best-answer (multiple-choice) vocabulary test. 


II. CONSTRUCTION OF THE TEST 


The vocabulary test consisted of seventy-five items. Each item 
contained a stimulus word and five response words; one of the response 
words was the correct or synonymous word, two were chance dis- 
tractors, and two were selected distractors. Seventy-one of the 
stimulus words were selected from the Teacher’s Word Book;? the four 
remaining stimulus words were of a frequency-range too small to be 
listed in the Word Book. As far as possible, words were chosen for 
the correct responses which had a Thorndike index number within four 
of that of the stimulus word; for example, when the stimulus word 
had an index of ten, the index of correct response word was between 
six and fourteen.* In forty-five items, the correct response had a 





1 Mr. Lorenz Meyer discussed the problem with the authors, selected half of the 
stimulus and correct response words, and most of the chance distractors. He 
also checked over the completed test. 

? Thorndike, E. L.: A Teacher’s Word Book of the 20,000 Words Found Most 
Frequently and Widely in General Reading for Children and Young People. New 
York, 1931. 

* This rule was not strictly applicable to words with an index of twenty, 
because words of indices twenty-one to twenty-four are not given by Thorndike. 
In the cases of three stimulus words of index twenty, the correct response word was 
not listed by Thorndike; in four cases, the stimulus word was not listed. In four 
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lower index number than the stimulus word, in eight items it had the 
same index number, and in fifteen items it had a higher index number. 

The chance distractors were chosen by opening the Word Book at 
random and selecting the first word of the same part of speech and same 
index number as the correct response word. Where the correct 
response was not listed in the Word Book, the chance distractors were 
of an index of twenty. 

The selected distractors were chosen without reference to the 
Thorndike index number, and the index number sometimes differed 
from that of the correct response word by as much as fifteen. No 
strict routine procedure was followed in the choice of the selected 
distractors. No effort was made to choose always an antonym, a 
homonym, or a word with similar initial or final syllable. Each 
stimulus word was considered individually, and the selected distractors 
chosen because they were assumed to have a high drawing power quite 
regardless of any possible classification they might fall into. 

Table I gives the frequency of the stimulus words, of the correct 
response words, and of the chance and selected distractors, classified 
according to the Thorndike index number. Although no attention 
was paid to the index numbers of the selected distractors, in general 
they are well distributed over the range of index numbers. 

The 150 selected distractors (two for each item) were roughly 
classified, after they had been selected, as follows: forty-three syno- 
nyms of homonyms, of near-homonyms, or of words derived from the 
same root; forty-one words differing slightly in meaning from the 
stimulus word; thirty-two antonyms; twenty-one words commonly 
associated with the stimulus word; seven homonyms or near-homonyms 
of synonyms; six words similar to the stimulus word in appearance. 

The stimulus words were of the following parts of speech: thirty- 
three nouns, twenty-nine adjectives, thirteen verbs. In all except 
four cases, the response words were of the same part of speech as the 
stimulus words. 

The items were arranged in the order of the index numbers of the 
stimulus words. The response words were so arranged that the correct 





additional cases where the stimulus word had an index of twenty, the synonymous 
word had an index below sixteen. Of the sixty-eight items in which the index 
numbers of both stimulus and response words were listed, the following differences 
between them existed: no difference, eight items; a difference of one, thirteen 
items; of two, fourteen items; of three, nineteen items; of four, ten items; of five, 
three items; and of seven, one item. 
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response occurred as frequently in position one as it did in position 
two, three, four, or five. Likewise, the selected and chance distractors 
were given one position within an item as frequently as another. 
After the results had been completed, a study was made of the influence 


of position on the frequency of checking; no appreciable effect was 
found. 


TaBLE I.— FREQUENCY OF STIMULUS AND RESPONSE WorpDs CLASSIFIED ACCORDING 
To THORNDIKE INDEX NuMBER 














Frequency of 
Thorndike 7 
index number | Stimulus | Correct Selected Chance 
words responses distractors | distractors 

1 4 11 8 

2 2 & 4 

3 - i 8 a 

4 3 4 13 8 

5 4 2 18 4 

6 5 6 10 12 

7 6 7 10 14 

& 6 i 8 “ 

9 5 3 9 6 

10 5 2 6 4 

11 3 7 2 14 

12 3 3 6 6 

13 4 5 2 10 

14 4 2 4 4 

15 2 10 5 20 

16 2 5 5 10 

17 2 3 1 6 

18 2 4 6 8 

19 1 1 a 2 

20 14 1 3 10 

SN, sade dsaedes 4 4 15 

iat 75 75 150 150 

















III. GENERAL RESULTS 


The test was incorporated as one section of an English placement 
test and given to six hundred thirty-seven entering freshmen at the 
University of Chicago. Although the directions included the admoni- 
tion, ‘‘Be sure to answer every item,’’ on the average 3.5 items were 
omitted per paper. 
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In order to secure some indication of the typicalness of the test, a 
correlation was made between the score on it and the score on the 
vocabulary section of the Minnesota Reading Test for College Students, 
which was also given to all students. The Pearson correlation coefhi- 
cient, based on a sample of two hundred students, was .88. This 
coefficient is high enough to suggest that this experimental test is a 
typical vocabulary test as far as results are concerned. 

The reliability of the test was computed by scoring the odd and 
even numbered items separately, correlating the scores so received, and 
estimating by the Spearman-Brown prophecy formula the reliability 
for the complete test. This was done on two samples of one hundred 
cases each. The two samples were then combined. The reliability 
was in each of the three cases .91. 

The score on the test was the number of correct responses multiplied 
by a factor of .4. This procedure made it possible to keep the squares 
of the scores within three figures and to punch the responses for each 
of the seventy-five items, the score, and the square of the score on the 
eighty-column card of the Hollerith machine. The scores ranged 
from two to twenty-seven. A perfect score was thirty. The mean 
was 17.11; the median, 17.93; standard deviation, 4.63; quartile 


deviation, 3.55. The distribution of scores was slightly negatively 


skewed: Sk = Ss Md) = —.53. 





IV. DISTRIBUTION OF ITEM DIFFICULTY 


The difficulty of the items was given by the number of correct 
responses. For convenience, the difficulty was expressed as the 
percentage of the six hundred thirty-seven students who answered 
the item correctly; it was assumed that omitting an item was evidence 
that the student did not know it. The average item difficulty was 
57.7; that is, on the average each item was answered by 57.7 per cent 
of the six hundred thirty-seven students. The standard deviation of 
the distribution of item difficulty was 25.9 per cent; the range was 
from seven per cent to ninety-nine per cent. 

The Thorndike index number is not of course a satisfactory index of 
item difficulty—partially because the difficulty is dependent primarily 
upon the nature of the distractors, and partially because the Thorndike 
index is an index of frequency and range rather than difficulty. The 
correlation between the difficulty (percentage of correct responses) 
and the Thorndike index was —.53 + .08 (standard error). 
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V. FREQUENCY OF CHECKING THE TWO TYPES OF DISTRACTORS 


For purposes of comparison, the test items were divided into four 
groups on the basis of difficulty. Group one, the easiest, contained 
twenty-six items, answered correctly by seventy-five per cent or more 
of the students; group two contained nineteen items of difficulty 
fifty to seventy-five; group three, twenty items of difficulty twenty-five 
to fifty; group four, ten items answered correctly by less than twenty- 
five per cent of the students. Table II gives the frequency with which 
chance and selected distractors were checked at each difficulty level. 
A significantly greater proportion of students checked the selected 
distractors at each level. Unquestionably the selected distractors in 


this examination had a much greater drawing power than did the 
chance distractors. 


TaBLE II.—Txue MEAN FREQUENCY witH Wuicu Eacu Types or Response Worp 
Was CHECKED WHEN THE ITEMS ARE GROUPED ACCORDING TO THE 
DirFicutty LEvEL INDICATED 




















Average number who Average number who 
checked each Difference: | 
Items answered | Number selected— | Diff. 
correctly by | of items chance dis- | eaift. Checked Omitted 
Chance Selected tractors correct aa 
distractor | distractor answer 
0-24 per cent. . 10 34.9+6.1'/203.2+36.8:168.3+37.3) 4.5? |100.2+12.3/60.7+10.9 
25-49 per cent. . 20 45.7+5.1 |123.0+°11.8) 77.3412.8) 6.0 (243.2+ 9.1/56.5+ 6.5 
50-74 per cent. . 19 28.4+6.4 | 86.94+11.7| 57.5+13.3) 4.3 (390.0+10.7/18.3+ 4.0 
75-09 per cent... 26 10.6+2.8 | 30.4+ 4.4) 19.9+ 4.9) 4.1 (649.1+ 9.0) 6.0+ 9.2 
Tiscesses 75 27.74+2.7 | 92.2+ 8.2) 64.5+ 8.6) 7.5 (367.4+19.0/29.9+ 3.7 


























1! All errors are standard errors, not probable errors. 

*A ratio of 4.1 indicates that one would expect from the same infinite population only two 
samples out of one hundred thousand to show differences as large as or larger than this one. Larger 
ratios than 4.1 are even less probable. The conclusion, therefore, is reasonably clear that these 


samples are probably drawn from different populations—that is, that a significant difference between 
them exists. 


To check this conclusion further, the number of persons who 
checked each distractor was expressed as a percentage of all persons 
who checked any one of the distractors of the item in question. In 
some cases, the base for the computation of the percentage was six 
hundred, and in other cases only two or three. The extreme unrelia- 
bility of the percentages based on the small number of cases throws 
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this method into severe question. On the average, 12.6 per cent of the 
students who marked any distractor on a given item marked each 
chance distractor, and 37.4 per cent marked each selected distractor. 
The difference of these two percentages, 24.8 per cent, is probably a 
significant difference. In other words, considering only those persons 
who marked distractors, and using the item as the basis, each selected 
distractor was marked by twenty-five per cent more students than 
marked the corresponding chance distractor. 

On the average, a test paper! had 43.3 items correctly marked and 
3.5 items omitted. Of the 28.2 items in which a distractor was 
marked, 6.5 (o« = 0.32) were chance distractors and 21.7 (ou = 0.53) 
were selected distractors. The ratio of the difference of the means, 
15.2, to its standard error, 0.61, is approximately twenty-five. On 
the average, each paper had a significantly greater number of selected 
than of chance distractors marked. 


VI. NATURE OF STUDENTS WHO CHECKED EACH TYPE OF DISTRACTOR 


Greater drawing power does not in itself signify a good distractor. 
An item with a distractor which draws equally high or higher scoring 


TaB_z III].—Mzan Score or StupENts WHO MARKED THE RESPONSE INDICATED 
CLassIFIED ACCORDING TO Four LEvEts or Item DirFFicuttr! 
































Items of difficulties indicated 
ENE eT NE ee 26 19 20 10 
Answered correctly by..................see00. 75-09 0-76 35-0 0-06 
per cent per cent per cent percent 

5 Ee ee ee 17.69 + .034/18.68+ .05)19.39+ .06)18.72+ .15 

2 Selected distractor...................... 14.02+.11 [15.03 + .08)16.25+ .06)17.30+ .06 

e i, . .  sccccccnsescescoes 12.62+ .18 |14.37 + .11)15.77 + .09|16.53 + .17 

ws candeocnsecesseces 8.47+ .35 |10.76+ .23)12.90+ .13/13.58+ .19 
Difference of means of those who marked correct 

response and those who marked selected dis- 

tractors, divided by standard error of difference. 33 33 | 41 y 
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' All errors are standard errors. 


students than does the correct response is a bad test item; it does not 


discriminate. 





1 These results are based on a sample of two hundred papers. 


In the test here described, the score of a student 


on the total test was taken as the criterion score. The items were again 
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divided into the four difficulty levels, and the mean score was computed 
for all persons who marked the correct response, a chance distractor, or 
a selected distractor. Table III summarizes the data. In each case 
the mean score of those students who marked the correct response is 
significantly greater than the mean score of those who marked a 
selected distractor. The use of selected distractors does not prevent 
the items from discriminating. 

It is probably true that the use of selected distractors makes the 
item more difficult. Thus the average scores of those who marked 
selected distractors is significantly higher on each difficulty level than 
the average score of those who marked chance distractors. 


VII. EQUALITY OF DISTRACTORS WITHIN AN ITEM 


Evidence has been offered to show that the chance element in a 
best-answer item is reduced when the distractors are marked with 
equal frequency.' It is pertinent, therefore, to find the relationship 
in this test between the frequencies with which each of the two chance 
(and also the two selected) distractors of an item were checked. A 
correlation was made between the number of students who checked the 
first chance distractor of a given item, and of those who checked 
the second chance distractor. A similar correlation was made for the 
selected distractors. The two chance distractors correlated .16 with a 
standard error of .11; the correlation is not significantly different from 
zero. Likewise, the two selected distractors gave a correlation of 
—.17, with a standard error of .11; again, the correlation is not 
significantly different from zero. Neither chance nor selected dis- 
tractors meet this criterion of equal frequencies of checking. A test 
to meet this criterion can be built only after careful item analyses have 
been made, and then doubtless several revisions are necessary. In 
this test, forty-nine per cent of the one hundred fifty chance dis- 
tractors were marked by less than two per cent (twelve) of the students. 
Sixteen per cent of the one hundred fifty selected distractors were 
marked by less than two per cent of the students. Twelve of the 
chance distractors were checked by no one, and fourteen were marked 
by one student only. In contrast, only two of the selected distractors 
were marked by no one, and only one of them was marked by but one 
student. Of the seventy-five items, forty-eight (sixty-four per cent) 





1 Horst, Paul: ‘‘The difficulty of a multiple-choice test item.’’ Journal of 
Educational Psychology, March, 1933, pp. 229-232. 
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had one or more distractors marked by less than two per cent of the 
students. 


VIII, CONCLUSIONS 


In only a few types of best-answer tests does one think of using 
“chance” distractors. In general, specific distractors are selected 
because one wishes to determine whether or not students can dis- 
criminate between the correct response and certain erroneous responses 
which one has reason to believe are common among the poorer students. 
In fact, the peculiar value of the test form may be that it measures a 
certain discriminatory ability. As the vocabulary test offers an 
opportunity for the use of ‘‘chance”’ material—to which some scientific 
virtue has been attributed—it makes possible investigation of the rela- 
tive values of the two types of distractors. 

Under the conditions of this experiment, selected distractors were 
found to be marked to a significantly greater extent than were chance 
distractors. This finding was true for items of each difficulty level 
measured. The selected distractors did not prevent an item from 
discriminating between the good and the bad students. Many 
more of the chance distractors were marked by no one or by a very 
small number of persons than was true for the selected distractors. 
As selected distractors are approximately as easy to find as are chance 
distractors, for persons who have had some experience in test construc- 
tion, there is good reason why selected distractors should be used. 





MEASURING CHILDREN’S ATTITUDES TOWARD 
THEIR PARENTS! 


ROSS STAGNER 
People’s Junior College, Chicago 
AND , 


NEAL DROUGHT 
University of Wisconsin 


The possibility of applying the ingenious techniques of Dr. Thur- 
stone (1) to the measurement of more intimate attitudes than those 
relating to impersonal topics such as prohibition suggests a quanti- 
tative approach to a problem of great psychological importance: The 
attitude taken by the child toward his parents. The significance of 
this attitude has been recognized for a long time by workers in the 
fields of juvenile delinquency, child problems, etc. Research, how- 
ever, has been hampered by the necessity of using very crude and 
unsatisfactory measures of attitude. 

In the course of a projected study of the relationships between 
parent and child as they influence personality, the senior author found 
it necessary to develop some technique for quantifying the subject’s 
attitude of affection or antagonism toward his parent. The scales 
reported here are the means developed.? 

The Thurstone technique is so well-known that repetition would 
be valueless. We shall only mention a few essential details for workers 
interested in the construction of the scales. One hundred twenty 
statements expressing various degrees of affection or antagonism were 
used. The attitude continuum was defined for the judges as the usual 
eleven-point scale ranging from “greatest possible affection”’ through 
neutral to “‘greatest possible aversion.” 

The statements were sorted by fifty judges, equally divided as to 
sex. (A different group sorted the statements for each scale.) This 
number of judges is considerably less than that advocated by Thur- 
stone; however, practical considerations made it seem necessary to 





1 This investigation was made possible by a fellowship grant for 1932-1933 
from the Social Science Research Council. 

* The scales were planned and statements gathered by Dr. Stagner. Assistance 
with this work was given by Dr. Thurstone and Dr. Ruth Peterson of the Uni- 
versity of Chicago. The judgments were gathered and the final scales con- 
structed by Mr. Drought. 
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finish the scales without the delays involved in obtaining more judges, 
and the highly satisfactory character of the results leads us to question 
the need for more than fifty. 

In the preparation of the first scale, the statements were phrased 
as applying to the subject’s father. As soon as it was completed, the 
statements were placed in the feminine gender and the mother scale 
prepared. Since the ideational content of the statements remained 
identical, we expected similar values for the two scales. This was not 
the case. It appears that a statement made about one’s father is 
more favorable than the same statement made about one’s mother. 

The difference is not statistically reliable, the mean difference 
being 0.25 with a sigma of 0.37, but eight statements were rated a full 
scale interval more unfavorable on the mother scale (none deviated 
that far in the opposite direction). These eight statements reveal 
some differences in the stereotypes ‘‘mother” and “father.” For 
instance, two statements were of the type, “‘we’re just good friends,” 
which apparently is a good thing to say about your father but not of 
your mother. Two indicated a distant or reserved attitude, which is 
rated indifferent for father but rather unfavorable for mother. ‘I 


' feel no affection for him; in fact, he frequently annoys me” was rated 


slightly unfavorable for father, quite so for mother. 

The differences in our stereotyped habits of thinking and feeling 
about our two parents comes to the fore quite clearly when we con- 
sider these statements. The father is conceived as a stern, strong, 
silent being who is perfectly within our expectations in being reserved 
and distant; indeed, when he unbends, we consider it a mark of rare 
good fortune. The mother, on the other hand, must be adored, and 
the type of affection one gives to a friend is below what we owe to her. 
Her faults must not be admitted to anyone. (This mother idealiza- 
tion is characteristic of an American civilization in which ‘‘mother- 
appeal”’ is used to sell everything from life-insurance to candy!) 


SEX DIFFERENCES IN JUDGES’ RATINGS 


Table I shows the mean ratings given the one hundred twenty 
statements on each scale by male and female judges. The most 
interesting fact revealed here is the highly ‘significant difference 
between father and mother scales for female judges. This may be 
interpreted as showing that girls really favor their fathers more, or 
it may be said to indicate that the girls, disliking their own fathers, 
interpret the statements as relatively more favorable than they other- 
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wise might. The men stand in remarkable contrast to the women in 
this respect, their mean scores differing by only 0.02 of a scale interval 
for the two parents. 


TaBLeE I.—MeEaAN Ratinos or OnE HuNDRED TWENTY STATEMENTS BY JUDGES 





Men /| Women 





EIT Se ee, OP Ta ee ee 5.61 5.44 
a ed el a lin eiky wine ai ahd 5.63 5.80° - 
Critical ratios. 
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TasLeE I].—Txe Spuit Hautr RELIABILITIES THE SPEARMAN-BROWN PREDICTIONS 
FoR Eacu SCALE AND THE CORRELATIONS WITH MEASURES OF SELF RATING 








F scale | M scale 

CS. <a piu Gu tiie sc chute deed ove Sees dkaek Seok .76 .72 
Spearman-Brown prediction.....................0+0..2+-. . 86 .83 
Self-ratings. 

TELE re ne ee ee 66 .55 

SRR RES MERE IS ag oh ae a ee 64 . 64 

ied eetSCUUOEEs Ske kia Gaseck canned eon 63 . 56 

ET ee er ee ee eT eee 74 .79 











RELIABILITY AND VALIDITY OF THE SCALES 


Table II shows the reliability and validity coefficients obtained 
for the final scales. Before discussing the table, we may briefly 
describe the scales themselves. 

Scale 1 (father) included forty statements ranging in scale value 
from 0.2-10.5. It was divided in half, each half including the same 
number of statements from a given portion of the attitude continuum. 
Unfortunately, we did not choose the same number from every portion 
of the continuum. (E£.g., four were chosen from the interval 2.1-3.0, 
but only two from the interval 9.1-10.0.) This error was corrected 
in preparing the mother scale and a revised form of the father scale. 

Scale 2 (mother) included thirty-two statements having approxi- 
mately the same range. It also was prepared so that the two halves 
were psychologically equivalent. Our method of determining the 
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reliability of the scales involved computing the correlation between 
the two halves and predicting by the Spearman-Brown formula the 
reliability of the whole scale. 

The results given below are based on slightly over one hundred 
college students of each sex. We have not yet attempted to apply 
the scale to younger subjects. 

The reliabilities shown in this table are fairly high. They are 
attenuated by several factors which could not be taken into account; 
for instance, the narrow range within which most of the subjects fall. 
Almost eighty per cent of the subjects fall within the range 2.1-4.0. 

In addition to the method described below, validity was estimated 
by correlations between each of the scales with graphic self ratings 
on authority, affection and confidence. A number of subjects were 
asked to rate themselves on a linear scale for their attitudes with 
reference to (1) submission-rebellion regarding authority of parent; 
(2) affection-aversion; and (3) confiding versus not confiding in the 
parent. The correlations with each scale and with the average of 
the ratings show that these opinion scales are measuring something 
of which the subject himself is also aware. Considering the probable 
low reliability of the graphic ratings, these coefficients are very high. 

The validity of the scales was also checked by autobiographical 
material. Biographies were collected in connection with the work on 
personality previously mentioned. In these it was noted that the 
subjects revealed their attitudes quite as clearly as on the opinion 
scales, although of course not in convenient quantitative fashion. 
We have, however, used a more objective method than trying to 
compare the biographical statements with the scale scores. 

Each of about fifty students answered numerous questions about 
both parents in the course of the biography. Answers to these ques- 
tions were dichotomized as positive or negative. (E.g., did your 
father take a close personal interest in you? Some answers were a 
clear positive and others a clear negative. If the experimenter was 
uncertain, the subject was omitted in that particular comparison.) 
Mean attitude scores were then computed for these giving positive 
and negative answers. In the instance cited, mean F score was 2.81 
for yes, 4.07 for no, in male subjects. For female subjects the figures 
were 2.51 and 3.38. In these scales a low score means a favorable 
attitude. ! 

Using this method, we have computed the effect of different 
parental practices upon the child’s attitude. The results are nearly 
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always what would be expected, and in the deviating cases, explana- 
tions can easily be seen. In the instance cited above, the group show- 


TaBLE III].—BrtoGrapuicaLt Data as RELATED TO 


FATHER ATTITUDE ScORES 





Question 





Father take a close personal interest in you?........ 
Did he demand obedience?....................... 
EE eT ree ee 
Spend much time playing with you?............... 
Did you idealize him?........... SRP ane ee 


EPEC PE ES 





Answer | Men | Women 
Yes 2.81 2.51 
No, etc.| 4.07 3.38 
Yes 4.06 2.56 
No 3.16 2.57 
Yes 4.42 3.14 
No, ete.| 3.22 2.67 
Yes 2.78 2.68 
No 4.07 3.31 
Yes 2.68 2.46 
No 4.20 3.58 
Yes 4.83 3.12 
No 2.82 2.84 











ing a negative father attitude showed a positive mother attitude. 
This we should very promptly explain as a compensatory reaction. 


TaBLeE I1V.—BioGrapHicaL Data as RELATED TO 


Moruer ATTITUDE ScoREs 





Question 





Was your mother happy?................ccccee. 
Did she demand and enforce obedience?.......... 
ies iat a ae ee ee Oo is claw eee 
EI 
NG bs ndvcccrevencacnensceses 


Have any conflicts with her?..................... 





Answer Men | Women 
Yes 2.50 2.25 
No 4.20 4.55 
Yes 2.90 2.34 
No, tried| 4.37 4.13 
Yes 4.59 5.86 
No 2.44 2.71 
Yes 2.74 2.16 
No 4.93 4.77 
Yes 2.58 2.20 
No 4.97 5.10 
Yes 4.54 4.53 
No 2.65 3.20 











Table III gives items from these biographies which resulted in 
differences of over one scale interval (1.00) on the father scale for 


either men or women. Reversal of direction 


of the difference between 


men and women subjects was very rare and always of small extent. 
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Table IV shows similar items for the mother scale. Standard errors 
of these differences range from 0.6 to 0.8. 

These tables need very little comment. The results uniformly 
indicate that the scales must be valid in their results. Only one 
possible factor, which we are inclined to discount, might enter in: 
this is the chance that students having an antagonistic attitude might 
report facts as colored by their prejudices. Personal interviews with 
the subjects and other methods of checking their veracity made it seem 
unlikely that this factor was of great importance. 


SEX DIFFERENCES iN ATTITUDE SCORES 


The emphasis of the Freudian theory upon sex differences in 
feelings and attitudes toward parents led us to look for further differ- 
ences beyond those shown in the results of the judges. Table V 
shows that no reliable differences occur when the final scales are used. 
Since the two sexes were represented equally in determining the scale 
values of the statements, we may have adequately controlled this 
factor. 


TaBLE V.—DISTRIBUTION OF ATTITUDE SCORES FOR MEN AND WOMEN 

















Father scale Mother scale 
Men Women Men Women 
0.1-1.0 0 0 0 0 
1.1-2.0 8 15 6 i) 
2.1-3.0 55 42 44 63 
3.1-4.0 34 24 14 9 
4.1-5.0 7 12 14 1] 
5.1-6.0 4 5 4 8 
6.1-7.0 3 4 10 10 
7.1-8.0 3 2 1 2 
8.1-9.0 0 2 1 0 
i a a ae 2.9 2.7 2.9 2.6 
a ke te 114 106 94 111 











INTERRELATION OF THE ATTITUDE SCALES 


It is probably significant to know, not only the scores made by 
subjects on the separate scales, but the degree and direction of con- 
comitant variation on the attitude continuum. 
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The correlation of the two scales, using forty-nine cases, is plus 
.168. While this figure indicates that no significant relationship 
between the two attitudes measured can be asserted, it is important 
to note that the relationship is positive rather than negative, and is 
about twice its probable error. 

The significance of the positive relationship lies in the fact that, 
under the usual interpretation of the Freudian theory, we should 
expect a negative correlation of the two scales. Father antagonism 
should be accompanied by mother fixation, and vice versa.' The 
truth of the matter is that the child is likely to take a similar attitude 
toward both parents. 

This also bears upon an important question concerning the primary 
cause of the attitude. If the behavior of the parents is the primary 
cause of the reported attitude, there is no particular reason for the 
positive correlation. But if we assume that the personality of the 
child enters as a common factor in determining the degree of indif- 
ference or antagonism to each parent, then we can easily understand 
the positive relationship. A tabulation of the scores of these subjects 
on the Bernreuter personality inventory indicates that those making 
high (antagonistic) scores toward each parent score high on self-suf- 
ficiency.? This fact, taken in connection with the positive correlation 
of the two attitude scales leads us to infer that self-sufficient children 
are likely to take a less favorable attitude toward both parents than 
children lacking in this characteristic, and as a corollary, that one of 
the important factors determining the attitude of the child toward his 
parent in his own personality. 


SUMMARY 


It has been shown that the Thurstone technique is applicable to 
the measurement of very intimate attitudes, in this case the attitude 
of a child toward his parent. This measurement, while necessarily 
subject to many criticisms, is consistent with itself, as shown by 
reliability coefficients, with biographical data, and with self-ratings 
on various specific attitudes toward the parent. 





1 If the process of identification is taken into account, this interpretation is not 
necessary. 
? These data will be reported in detail in subsequent articles. 
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The data contradict a common interpretation of the Oedipus com- 
plex theory, but this contradiction appears to be more superficial than 
real. 

There is some indication that the child’s attitude toward his parent 
is determined, not only by parental treatment, but also by the per- 
sonality of the child. 
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THE EFFECT OF KNOWLEDGE OF RESULTS ON MAZE 
LEARNING AND RETENTION 


LELAND W. CRAFTS AND RALPH W. GILBERT 
New York University 


The aim of the present experiment was to determine the effect of 
knowledge of results upon the learning and the retention of a stylus 
maze. This knowledge consisted (1) of informing the subject before 
he began his practice as to the average score in trials, errors and time 
made by a control group in learning the maze, and (2) of telling him, 
after every trial of the learning, the total number of errors he had 
committed and of trials and minutes he had required thus far. Hence 
for the subject, comparison with the scores of other individuals, and 
presumably some measure of competition with them, as well as 
knowledge of his own progress, were involved. 

Generally speaking the experimental literature shows that such 
incentives have a favorable influence upon learning. With reference 
to knowledge of results per se, 7.e. to informing the subject of his own 
individual scores but of those only, Arps? and Crawley’ found that such 
knowledge improved work with an ergograph; Book and Norvell? 
found it beneficial for college students doing cancellation, substitution 
and multiplication; and both Brown‘ and Leuba,” using school 
children as subjects, found it advantageous for arithmetical work. On 
the other hand Ross‘ has recently shown that information as to marks 
on weekly tests has little if any consistent effect upon grades made in 
college classes; and certain minor experiments, such as those of 
Deputy® and of Colburn, Collins and Myers*® have yielded similar 
results. 

In addition to informing subjects of their own individual progress, 
incentives of a definitely competitive type have also been employed in 
numerous experiments. These incentives, it is important to note, may 
be classified as follows: (1) Knowledge of the scores made by others. 
(2) Instructions to compete with others. (3) Working with and in the 
actual physical presence of those subjects designated as rivals. (4) 
Rewards for superiority and punishments for inferiority within a 
group, these including the publication of the relative standing of its 
members, and the bestowal of special attention, praise, or more 
material tokens of merit upon those achieving superior rankings. The 
effect of the above incentives upon learning has been almost without 
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exception favorable, as is evidenced by the studies of Whittemore” 
and Sims" with adults, and of Chapman and Feder,’ Symonds and 
Chase,'® Ross,'!® Leuba” and Maller! with children.* 

The present experiment differs from those of the ‘knowledge of 
results only” type in that the subjects were informed not only of their 
own score but also of the average made by others in learning the maze 
and were likewise urged to try to surpass the latter. It differs from all 
those of the “competitive” type listed above in that the subjects 
worked in the presence of the experimenter only, and were in no way 
rewarded or punished for their success or failure save as the knowledge 
thereof on the part of both themselves and the experimenter may have 
possessed such functions. In fact the only previous experiment in 
which the incentives used closely approximate our own is that of 
Panlasuigi and Knight.'* In this investigation two equated groups of 
fourth grade children were given twenty drills in arithmetic distributed 
over a period of twenty weeks, those in one group keeping a record of 
their own scores and being informed also of the average score for the 
whole class, those in the other working without that record and that 
knowledge. Under these conditions the experimental group slightly 
surpassed the control, though this advantage was confined almost 
entirely to the upper quartile of the former. Whether, however, the 
use of similar incentives with college students learning a maze would 
aid in either learning or retention cannot of course be deduced from the 
results of an experiment so different in both materials and subjects. 

In fact no data exist from which any sound generalization can be 
drawn as to the significance for learning of an individual’s knowing 
both his own results and those of other subjects. Yet from a practical 
viewpoint, particularly in the field of education, it would be very much 
worth-while to accumulate facts relative to the value of just such 
incentives as these. Such incentives are simple, easily formulated and 
applied, and do not require elaborate, and often impracticable attempts 
to stage enthusiastic competitions among pupils to the accompaniment 
of exhortations, the pitting of matched individuals against each other, 
the awarding of prizes; and the like. It was with the intention, 
therefore, of contributing to existing knowledge at least to the extent 





* In this connection it might be pointed out that the effect merely of working 
with others instead of alone, which Mayer,'? Elkine® and Allport! have found to 
be on the whole advantageous to various work of an academic and laboratory test 
nature, is probably due largely to the atmosphere of competition which it com- 
monly induces. 
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of finding out whether these incentives would be advantageous in the 


special case of maze learning by college students that this experiment 
was undertaken. 


PROCEDURE 


Subjects.—The subjects were fifty male students in introductory 
psychology at New York University, divided into two groups of 
twenty-five each. Those in the control group learned the maze 
first, and set the standard for the experimental group, who learned it 
some months later. It is the experimenters’ belief, however, that the 
two groups are wholly comparable with respect to age, college class, 
intelligence and scholarship. 

Apparatus.—The maze employed was McGeoch’s and Melton’s"® 
medium maze, cut from three-ply veneer, measuring 8.5 by 11 in., 
mounted on a bakelite base. Curtains concealed the maze from the 
subjects. 

Method.—The control group learned the maze under the conditions, 
and with the instructions, customary in stylus-maze experiments. 
The criterion of learning was to traverse the maze two of three times in 
succession without going so far into a cul-de-sac as actually to touch 
its farther end with the stylus (such an error is hereafter termed a 
“contact”? error). Mere entrances into a blind alley (called 
“entrance” errors) and retracings were recorded, but were not regarded 
as errors from the viewpoint of satisfying the criterion. In scoring 
retracings every turning of a corner was counted as a separate error. 
After an interval of one week the maze was relearned according to the 
same criterion as before. Many of the subjects attempted rehearsals 
of the maze during this interval, but the amount of this recall, as 
answers to questions asked after the retention test showed, was no 
greater in one group than in the other. 

The experimental group worked under precisely the same conditions 
except that for the learning, but not for the relearning, the following 
information and instructions were given them on a typewritten sheet. 
Since these constitute the special incentives employed, they are 
reproduced verbatim. 


The average score made by men students at this university, working under 
exactly the same conditions as you are, is to learn this maze: 
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(Note.—The twenty-four minutes is actual working time only, and does not 
include rest intervals between trials.) 
Try to do better than this average. 


In order that you may keep track of your progress, I will tell you after every 
trial your total trials, errors and time up to and including that trial. 


After the subject had read the above and the experimenter had 
made certain that he understood it, the sheet was placed face up on a 
chair and the subject was urged to glance at it during the learning 
whenever he wished to do so. Most subjects did refer frequently to it 
during that period. 

As the instructions state, the subject was told by the experimenter 
after every trial how many trials he had had thus far, how many errors 
he had made, and how many minutes he had spent actually traversing 
the maze. As soon, however, as a subject exceeded the average for 
any one of these criteria, the experimenter made no further reference 
to it and now gave the cumulative totals for the other criteria only. If 
a subject’s score exceeded the average according to all criteria, the 
learning was completed without any further information or comment 
from the experimenter. 

As soon as a subject in this group had finished the learning, he was 
asked various questions of an introspective type, which are given 
verbatim later, and his answers thereto were recorded. 

In both groups any subject was regarded as having failed if his total 
learning time exceeded sixty minutes, or if he worked for sixty minutes 
without making any significant progress, or if he was unable within 
thirty minutes to complete the first trial. The number of subjects 
excluded for these reasons was four in the control group and two in the 
experimental. 


RESULTS 


The results of the experiment are given in Tables I to V inclusive 
on pp. 181 and 182. 

In learning (Table I) the control group was slightly superior in 
trials and time and the experimental in errors. Since, however, the 
differences are all very unreliable and are likewise not even consistent 
in direction, the two groups may be regarded as showing no significant 
differences in learning. 

In performance on the two final trials of the learning (Table II) the 
groups also do not differ significantly. 
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TasBLe I.—Tse Means, STaNDARD DEVIATIONS AND CRITICAL RATIOS FOR 
DIFFERENCES BETWEEN THE MEANS FOR CONTROL AND EXPERIMENTAL 
Groups FoR LEARNING 











Control Experimental ; 

Bie . Diff. / 

Criterion of learning SD diff 

Mean |SD dist.| Mean |SD dist. 
ee eae ee 19.2 6.0 21.7 11.6 1.4 
i le le i 185.3 | 103.8 | 164.5 | 79.1 | 0.8 
ES 8 ie te tan ten neta ed 166.5 | 119.5 | 154.3 | 115.2) 0.4 
a inna e ele ak Sh ea ee a 1438.4 | 749.1 | 1562.2 | 719.0 0.6 




















TaBLE II].—Tue MEANS FOR THE Two FINAL TRIALS OF THE LEARNING FOR 
CONTROL AND EXPERIMENTAL GROUPS 














Criterion of performance Control Experimental 
TS. nau: co RUG Bees é eax ak Cab ee 0.36 0.84 
as 25 Shs 4b bee eek Pie ads Oe baa ORs 0.58 0.32 
Es Risa dias bee dade RAs sada eked 28.6 29.0 





Tas_eE III.—Tue Means, STANDARD DEVIATIONS AND CRITICAL RATIOS BETWEEN 
THE MEANS FOR CONTROL AND EXPERIMENTAL GROUPS FOR RELEARNING 














Control Experimental ' 
=r . Diff. / 
Criterion of relearning SD diff 
Mean | SD dist. | Mean | SD dist. P 
Er re rer 5.9 4.6 6.1 4.9 0.1 
Contact errors............ 16.8 14.6 14.6 12.0 0.6 
Retracings®............... 11.7 16.3 9.0 9.7 0.7 
pr ee 214.3 166.4 187.8 114.8 0.7 

















In retention as measured by relearning (Table III) the experimental 
group is slightly superior by all criteria save trials. But although the 
differences are fairly consistent in direction, none is of sufficient 
magnitude to justify the conclusion that either group is reliably 
different from the other. 

In retention as measured by recall, z.e. by performance on the first 
trial of the relearning (Table IV), the experimental group was slightly 
superior by all three criteria. The differences, however, are far too 
small to be significant. 
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TaBLE IV.—TueE MEANS FoR RECALL, i.e. FOR PERFORMANCE ON THE First TRIAL 
OF THE RELEARNING, FOR CONTROL AND EXPERIMENTAL GROUPS 











Criterion of performance Control Experimental 
| 
ee ee esas keen eek een eeid | 5.6 4.2 
i Sr 6a dae e es 6d ae eRe eA pe eens 4.7 4.3 
eae ba pea onawacne cA eeah wie eek ene 54.1 53.5 








TaBLE V.—TnHE Savinc Scores, COMPUTED FROM THE MEANS GIVEN IN TABLES 
I anp IV, FoR THE CONTROL AND EXPERIMENTAL GROUPS 











Control Experimental 
Criterion of 
learning-relearning Absolute | Per cent | Absolute | Per cent 
saving saving saving saving 
ET eee Sere 13.3 70.0 15.6 71.2 
Comtact Grrors............0.000. 168.5 90.9 148.9 91.1 
FE ER a ee eae 154.8 93.8 145.3 94.1 
Ria os os 6 ie een ou ane 1224.1 84.4 1374.4 89.7 

















In retention as measured by saving (Table V) the experimental 
group is slightly superior in absolute saving in trials and time, and in 
per cent saving according to all criteria. Again, however, the differ- 
ences between the two groups are too small to be regarded as significant. 

In short, all the data, both for learning and for retention, disclose no 
difference between the control and the experimental groups that even 
approaches reliability. 


The questions asked the subjects after the completion of the 
learning and their answers thereto are as follows: 


1. Did you think of, keep in mind, throughout the learning the averages for the 
other subjects given you at the beginning of the experiment? 

Yes, nineteen subjects. 

Yes, qualified by ‘‘at first,” “‘later,”’ ‘‘sometimes,” six subjects. 

2. Did you give equal attention to all of the three standards of trials, errors 
and time, or did you give more attention to one than to the others? If the latter, 
to which did you give more attention? 

Equal attention, four. 

More attention to trials, five. 

More attention to errors, five. 

More attention to trials and errors, four. 

More attention to time, seven. 
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3. Did you find this setting of a standard, and repeated information as to 
your own progress, emotionally disturbing? 

No, eight. 

-No in general, but some emotion admitted, three. 

Yes, six. 

Yes, qualified by ‘‘at first,’’ ‘‘a little,”’ “‘somewhat,’”’ “‘at times,’”’ seven. 

Not answered, one. 


4. Do you think that you learned the maze more rapidly under these condi- 


tions than you would have had you been ignorant both of the standard and of your 
own results? 


No, thirteen. 

No, qualified by “probably,” one. 
Yes, nine. 

Yes, qualified by ‘‘probably,” two. 


DISCUSSION 


To find no significant differences in learning or retention between 
subjects working with and without knowledge both of their own scores 
and those of others does not accord with expectations based on the 
results of previous experiments. The reasons, however, for this 
divergence between our results and those of most other investigations 
seem to be as follows: 

1. Ross,’* as we have already remarked, found that knowledge of 
scores made on college tests had no appreciable effect on the grades 
achieved on later examinations, though such differences as did exist 
favored the knowledge group. In his opinion one cause for the failure 
of that group to significantly surpass the control was that most college 
students have in any case a ‘moderately accurate subjective impres- 
sion’”’ as to the quality of their work, so that even the control subjects 
had some notion as to their achievement. In the present experiment 
similarly, while few if any subjects knew what really constitutes a good 
maze learning score, even the control group had clearly before them 
a definite goal—namely, to traverse the maze two out of three succes- 
sive times without error—and could see and appreciate their progress 
towards it. In many of the studies previously cited, however, the 
control group was without any knowledge of results whatsoever and 
worked blindly upon cancellation, substitution, arithmetic problems, 
etc. with no idea either of their own scores or of any goal towards 
which any preceptible advance could be made. 

2. Ross also expresses his belief that college work is in general so 
strongly motivated that the substitution of an exact and objective for a 
vague and subjective knowledge of results is an incentive too slight for 
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its introduction to effect any appreciable increment to score. This 
conclusion also seems applicable to our subjects. For the latter were 
all volunteers for the work, usually seemed to view the maze as if it 
were a psychological test of personal significance to them, and for the 
most part appeared to do their utmost to learn it rapidly andwell. It is 
very possible that under these circumstances even information as to 
the scores of others and instructions to try to surpass them would, as 
an additional incentive, be devoid of any measurable effect. _ 

3. It is very probable that our imposition of a standard upon our 


subjects, our urging them to surpass it, our summarizing of their 


progress thus far after every trial, constituted a set of conditions 
which were emotionally disturbing to many of them. For sixteen 
admitted that emotions were aroused, at least at first or sometimes; and 
since most students probably feel that absence of such disturbance is 
more creditable to them than its presence is, we are inclined to regard 
the evidence of the sixteen who reported emotion as more reliable than 
that of the eight who denied it. Whether these emotions were in 
actual fact intense enough to interfere seriously with the progress of 
the experimental group, we of course cannot say. But that disturb- 
ance from this source may greatly retard learning has long been a 
generally accepted principle. 

4. As has already been pointed out, the competitive incentives 
employed in the present experiment differed from those used in most 
other studies in that there was no working with and in the actual 
physical presence of designated rivals, nor any definite rewards or 
punishments for superiority or inferiority, such as prizes, special 
commendation, the publications of the relative standing of the mem- 
bers of the group, and the like. In this respect the experiment is not 
strictly comparable with previous investigations, and it is not sur- 
prising that this important difference in method should bring about a 
corresponding difference in results. 

5. Finally the complexity of our triple criterion of trials, errors 
and time should be mentioned. As the answers to question 2 show, 
only four subjects kept all three measures equally in mind throughout, 
and most of them testified that they gave special attention to some one 
of the three to the relative neglect of the others. It is therefore pos- 
sible that the use of a single simple criterion such as trials alone would 
have yielded better scores, at least according to that one standard. 
However analysis of the data shows that there was no relation between 
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reported attention to any one criterion and score thereon. JI.e., those 
who said they gave special attention to, for example, errors did no 
better on errors and no worse on time than did those who reported most 
attention to the latter. This result is, however, quite understandable 
when we reflect that a subject may be led to become particularly con- 
cerned with some one criterion by reason of his poor performance when 
measured thereby. Hence the members of the group which gave 
special attention to, say, errors were in part selected by virtue of their 
inferiority according to that very criterion itself. 

It might also be added that the subjects’ opinions as to whether or 
not they would have done better had they received none of the infor- 
mation given were divided about equally between affirmative and 
negative answers, but that it is naturally rather doubtful whether 
these replies merit any serious consideration. 

In the present experiment, therefore, the employment of the 
incentives of knowing both one’s own score and that of others, together 
with instructions to surpass the latter, was without significant advan- 
tage either for the learning or the retention of a maze. It is not diffi- 
cult, however, to find factors, in connection both with the nature of 
the subjects and with the details of the procedure used, which seem 
capable of accounting for this result; and the differences in method 
between this study and others, particularly when the latter are of the 
competitive type, amply explain the divergence between our findings 
and those of most other investigations within this field. Our results, 
however, do suggest that the value merely of giving knowledge of 
results and defining standards of achievement can easily be exag- 
gerated, and that these are not incentives which can be expected to 
bring about significant improvement in learning without regard to the 
conditions under which they are employed. Hence from this view- 
point at least we are in complete agreement with Ross'* whose work 
has already been described. On the other hand no conclusion to the 
effect that incentives of these types must in general be expected to be 
ineffective is permissible from our limited data. For such devices 
might very well be of great advantage when used with tasks in which 
progress is less evident than it normally is in maze learning, with 
subjects who are less strongly motivated than college students volun- 
teering to serve in a laboratory experiment usually are, and with 
accompanying procedures less likely to be emotionally disturbing to 
those exposed to them. 
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SUMMARY 


The aim of the experiment was to determine the effect upon the 
learning and retention of a stylus maze of informing a subject before- 
hand as to the average score in trials, errors and time made by others in 
learning it, and also of telling him after every trial the total number 
of errors he had committed, and the total number of trials and of 
minutes he had himself required thus far. The subjects were fifty 
male college students, divided into two groups of twenty-five each. 
One group, the control, learned the maze without receiving the infor- 
mation described above; the other, the experimental, did receive 
it. After an interval of one week both groups relearned the maze. 
The results were that no reliable differences existed between the two 
groups either in learning or in retention. This outcome, though 
perhaps contrary to theoretical expectation, seemed however quite 
explanable on the basis of such factors as the presence of some knowl- 
edge of results on the part of the control group, the already high 
motivation of all the subjects, the emotional disturbance which the 
experimental situation probably often aroused, and the absence of 
the definite rewards and punishments for success or failure which have 
been employed in most other studies of the competitive type. It was 
concluded, in substantial agreement with Ross, that the incentives 
used cannot be expected to be universally effective for learning without 
regard to the conditions of theiremployment. But on the other hand 
their failure to affect the maze learning of college students in no way 
justifies the inference that such incentive would be similarly ineffective 
with other tasks and other subjects. 
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THE EFFICACY OF INSTRUCTION IN NOTE MAKING 


STEPHEN M. COREY 
University of Nebraska 


Despite what appears to be the inefficiency of the lecture as a 
method of teaching,' its use persists and causes college freshmen 
much difficulty because of their inability to make a satisfactory record 
of what the lecturer says. Most investigations of the factors which 
are responsible for the failure of college students call attention to 
the serious consequences which ensue from this inaptitude.? Jones 
contended that freshmen at the University of Buffalo regarded the 
practice given them in making notes from lectures as being the most 
valuable drill in the entire remediation program.’ 

With considerable evidence of this type before them, the authors of 
books and pamphlets on ‘‘How to Study” and ‘“‘How to Succeed in 
College”’ have written a great deal on the best methods for making 
notes. Bennett,‘ Werner,® Bird,® and Crawford’ have allocated an 
entire chapter of their respective texts to this problem, and study 
manuals such as those by Wrenn,* and Pressey and Ferguson,’ go even 
further in their efforts to impress upon students the value of better 
note making. 


There is a great deal of agreement in the recommendations made 
by these authors. Students are told to pay close attention to the 


1 Corey, Stephen M.: ‘‘Learning from Lectures vs. Learning from Readings.” 
J. Educ. Psychol., Vol. XXV, 1934, pp. 459-470. 
2See Knode, J. C.: Orienting the Student in College. Bur. Pub. Tchrs. Col., 
Columb. Univ., 1930, p. 89. 
Remmers, H. H.: “‘ A diagnostic and remedial study of potentially and actually 
failing students, etc.” Bull. of Purdue Univ., Vol. XXVIII, No. 12, 1928, p. 149. 
Jones, E. S.: “‘Studies from the Office of Personnel Research.”’ Univ. of 
Buffalo Studies, Yol. VIII, No. 1, 1930, pp. 39-47. 
Pressey et al.: Research Adventures in University Teaching. Public School 
Pub. Co., 1927, p. 14. 
’ Jones, E. S.: Op. cit., p. 48. . 
4 Bennett, M. E.: College and Life. McGraw-Hill Co., Ch. 12, 1933. 
5 Werner, O. H.: Every College Student’s Problems. Silver Burdett & Co., Ch. 9, 
1929. 
6 Bird, Charles: Effective Study Habits. Century Co., Ch. 5, 1931. 
7 Crawford, C. C.: The Technique of Study. Houghton Mifflin Co., Ch. 2, 1928. 
* Wrenn, C. Gilbert: Practical Study Aids. Stanford Univ. Press, Ch. 4, 1931. 
®* Pressey, L. C. and Ferguson, J. M.: Student’s Guide to Efficient Study. 
Richard R. Smith, 1932. 
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lecturer, to make notes in outline form, to get things down accurately, 
to guard against writing down too much, to label their notes carefully, 
to keep them neat, to let the notes be a record of the student’s thoughts 
as well as the lecturer’s statements, to review soon after the. notes 
are made, and so on. 

It would seem reasonable to infer from the amount of attention 
given this type of advice that some effort had been made to determine 
whether or not it produces results. Such, however, seems not to 
have been the case. There is rather objective evidence available that 
certain methods of note making are characteristic of better students! 
but the important question as to whether advising students to make 
this type of notes benefits them or not is as yet unaswered. Reports 
have been made of improvement in note making ability which follows 
upon supervised practice,? but the writer has been unable to discover 
any evidence that telling students how to make notes leads to better 
note making. 


THE PROBLEM 


The present study is an attempt, in view of this lack of evidence, 
to see whether students who have been given formal instruction in note 
making actually make better notes than students who have not been 
given such formal instruction. The experimental technique employed 
made use of two equated groups of students. To one of these, the 
experimental group, were given detailed instructions on how to make 
notes from lectures. Both the experimental and control groups then 
listened to a long lecture on which they were to make notes that later 
were evaluated. The method of evaluation was that first employed 
by Thompson,’ namely, allowing students the use of their notes in 
taking an examination over the lecture on which the notes were made. 
If telling students how to make notes causes them to do so more 
intelligently, the experimental group to which this sort of instruction 
had been given should have done appreciably better with the examina- 
tion on which the notes were used. 


1 Pressey, L. C. et al.: Op. cit., pp. 1-10. 
2 See Pressey, L. C. et al.: Ibid., pp. 11-21. 
Jones, E. 8.: Op. cit., pp. 47ff. 
Lemon, A. C.: “Studies in Education.” University of Iowa Studies, Vol. 
III, No. 8, 1927, p. 75. 
3 Thompson, Lorin A.: “‘A report on a note taking experiment at Ohio Wesleyan 
University.”” Ohio College Association Bulletin 77, Ohio State University, 
Columbus, Ohio (no date). 
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THE METHOD 


The subjects studied were freshmen in the Teachers College at the 
University of Nebraska. The investigation was conducted at a time 
during the semester when attention was directed to the improvement 
of note making ability. Two groups of students, each totalling one 
hundred, had previously been given Form 17 of the Ohio State Psycho- 
logical Examination which made possible preliminary equation with 
respect to total psychological test scores and vocabulary test scores. 
These equating data are given in Table I. 


TaBLe I.—EQuaTION OF EXPERIMENTAL AND CONTROL GROUPS 








re | Chances in one 
ie Control . hundred that the 
mental Difference : : 
(N = 100) difference is 
(N = 100) pene 
significant 
Vocabulary test...... 21.50 Mn 22.14 Mn .64 
3.88 SD 4.28 SD .58 SD 86 
Psychological test.....; 90.20 Mn | 89.75 Mn 45 
33.45 SD 34.5 SD 4.81 SD 54 

















It can be seen from Table I that the groups were statistically 
comparable, although the scores of the control group slightly exceeded 
those of the experimental group with respect to the vocabulary test. 
In as much as the subjects were selected by chance from a relatively 
homogeneous group of first semester college students, the magnitude 
of the groups would practically guarantee the automatic equation 
of other factors.! 

The control group during the first class meetings after the Thanks- 


‘giving vacation was given a forty-five minute lecture on ‘‘The Teachers 


College.” This lecture was written out, memorized, and delivered 
with the manuscript before the lecturer for reference. The students 
were instructed to make notes and were told that three weeks later 
these notes would be returned to them to be used in taking an examina- 
tion covering the lecture. This seemed to provide satisfactory 
motivation. 





1 Corey, Stephen M.: ‘‘The Dependence upon Chance Factors in Equating 
Groups.” Amer. J. Psychol., Vol. XLV, pp. 749-752. 
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The experimental group was given at this same time a lecture on 
how to make notes. The subject-matter of this lecture consisted of a 
summary of what appeared to the writer to be the consensus of opinion 
of the authors who had written on note making. Attention was 
directed to points such as the following: Making notes in outline 
form; putting down main ideas; attempting to get names, dates, places, 
etc., when given; reacting to the lecture by making notes on personal 
opinions. In addition to this instruction the experimental group was 
assigned a chapter in the textbook dealing with note making which 
they were to outline and hand in. These two measures, the lecture 
on note making and the outline prepared on the textbook chapter 
dealing with the same topic, meant that the experimental group spent 
at least three hours learning about note making. 

After this preliminary instruction the experimental group was 
given the same lecture as had been delivered to the control group. 
The instructions that notes should be made were repeated and the 
students were admonished that these notes would be returned for 
use three weeks later in an examination given over the lecture. 

The examination which was administered to both groups in an 
effort to measure their note making ability was a combination true- 
false, multiple-choice, and completion test of fifty-four items. Its 
reliability, determined by correlating odd against even items and 
stepped up by the Spearman-Brown formula, was +.79 for the experi- 
mental group and +.78 for the control group. These reliabilities 
would appear to be high enough to make group comparisons valid. 

This examination was administered three weeks after the ‘‘ Teachers 
College” lecture had been delivered because of the fact that it was 
intended to measure the value of notes rather than retention. Fur- 
thermore the examination was administered two times in immediate 
succession. During the first administration the students were not 
allowed to use their notes but during the second administration their 
notes were allowed them for reference. This last procedure was to 
test the claim sometimes made that notes are most valuable while 
being made.! If that is true the experimental group, had it benefited 
from the instruction in note making, should have done better on the 
“Teachers College” lecture test whether or not the notes were used. 

After the examinations had been scored, the experimental and 
control groups were compared with respect to (1) total scores made on 





1 See Werner, O. H.: Op. cit., Ch. 9. 
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the examination (a) without the use of notes, and (b) with the use of 
notes; and (2) the achievement of students (a) ranking in the lowest 
quartile of the Ohio State Psychological Examination, (6) ranking in 
the middle two quartiles of the Ohio Test, and (c) ranking in the 
highest quartile of the Ohio Test. 


RESULTS 


Table II represents a summary of the scores made on the examina- 
tion covering the lecture, for which examination the notes taken on the 


TaBLE II.—Scores ON EXAMINATION IN Wuicu Notes WERE Not UsEp 
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lecture were not used. It can be seen that the difference between the 
achievement of the experimental and control groups was not at all 
significant. As a matter of fact the control group—that is, the 
subjects who were given no formal instruction in note making—was 
slightly superior to the experimental group. 

Table III summarizes the scores on the examination in which the 
notes were used. Again the difference between the two groups was 


TaBLeE III.—Scores on EXAMINATION IN Wuicu Notes WERE UsEep 








, . Chances in one hun- 
Experi- | Control | Pie | dred that the differ- 
mental ence “ia Paine 
ence is significant 
TTAB 0.6 6'c os ctcene -| 40.46 41.54 1.08 
REAR tte Pee 4.32 5.32 .68 94 

















small, but the chances that it was statistically significant were ninety- 
four in one hundred. Contrary to expectation, however, the control 
group—those students who were given no instruction in how to make 
notes—did better on the examination in which their notes were used 
than the experimental students. 

In view of the equation of the two groups and their size, this result, 
even though the superiority of the control group was not statistically 
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significant; might justify the inference that the instruction in note 
making, at least in the amount which was offered in the present study, 
was more confusing than beneficial. Another alternative conclusion 
might be that the test given was not of the sort which would measure 
superior note making ability. In answer to this last criticism it 
should be noted that the writer had a semi-objective type of test in 
mind when he gave the experimental group lectures on how to make 
notes. In other words, the lecture on note making was planned with 
the end in view of making the notes more valuable for an examination 
such as the one used. 

Table IV summarizes the effect of instruction in note making upon 
students scoring in different quartiles of the Ohio State Psychological 
Examination. The scores represent the students’ achievement on the 
examination for which they were allowed to refer to their notes. 


TaBLE I1V.—Errect or INSTRUCTION IN Note MAKING UPON STUDENTS SCORING 
IN DIFFERENT PsYCHOLOGICAL TEST QUARTILES 





























: Middle two ; . 
Lowest quartile quartiles Highest quartile 
Experi | Control | “=P | controi | = P*™ | Control 
mental group mental group mental grou 
group group group P 
rere 36.34 | 37.74 40.62 41.92 44.4 44.96 
NN cn tite is ghee bine 4.24 5.8 4.32 4.5 3.18 2.84 
Difference between ex- 
perimental and con- 
trol groups.......... 1.40 1.30 .56 
SD difference......... 1.43 .88 86 
Chances in one hundred 
that difference is sig- 
I es ada aa uu 84 92 67 














The purpose of this analysis of the examination scores in terms of 
psychological test quartiles was to validate the contention that 
superior students benefit more from formal instruction in how to 
make notes than do inferior students. Jones, discussing his experi- 
ences with remediation work at the University of Buffalo called 
attention to the fact that the notes of superior students were very 
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similar to the notes of inferior students because neither was ‘used 
to this art.” He added, however, that the former group responded 
“very rapidly to suggestions and the second set of notes shows every 
improvement over the first.’”! 

The data from Table III bear out this contention only in the most 
indirect manner. In each case the control students achieved scores 
which were slightly but not significantly superior, but the superiority 
was least marked in the comparison involving students in the highest 
psychological quartile. 


SUMMARY AND CONCLUSIONS 


It would appear from the evidence presented above that formal 
instruction in note making has very little effect on note making ability 
in so far as this ability is indicated by higher achievement in an 
examination for which the notes are used. This conclusion casts 
some doubt on the practicability of much of the advice which is given 
freshmen students in orientation courses. They are told how to 
make outlines and how to write examinations and how to use the 
library and so on. This type of advice, or more exactly, this type of 
formal instruction, probably has very little effect upon student 
behavior. The writer has discovered in connection with instruction 
on “how to take an examination” that even though students can 
repeat all the suggestions which are made, they may nevertheless 
violate all of them in an actual examination on which an opportunity 
is given to apply this knowledge. 

The only alternative to formal instruction of the type discussed in 
this paper is, of course, actual practice. As was stated in the opening 
paragraphs of this report, there is considerable evidence that practice 
in note’ making or practice in outlining or practice in writing examina- 
tions can be extremely beneficial if carefully directed. Such, however, 
would not seem to be the case for formal instruction without practice. 





1 Jones, E. S.: Op. cit., p. 58. 
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IQ IN RELATION TO GRADUATION AFTER FAILURE! 


HELEN C. GOODMAN 
Board of Education, New Haven, Conn. 


The elimination of pupils from the secondary schools through fail- 
ure in subject-matter has become an outstanding educational problem. 
To discover whether such failures are due largely to lack of mental 
ability or are dependent upon other factors, the following study in the 
New Haven High School was undertaken. It consists of an investi- 
gation of the existing records of thirteen hundred seventy-three pupils, 
who have failed one or more times in their high school careers, to deter- 
mine whether IQ is the dominant factor in conditioning graduation 
after this kind of failure has occurred. The practical question involved 
is the part that the IQ rating should legitimately play in advice to 
pupils concerning continuing or discontinuing school. 

‘“‘Failure”’ in the high school in which the present study was made 
consists of a mark of F (sixty or below) in two major subjects for two 
succeeding marking periods. The total number of cases involved was 
limited to those failures who had attended high school for at least four 
years, the minimum time essential to complete the high school course. 
For comparative purposes, they are divided into two groups, those 
who subsequently graduated, and those who did not. They are 
designated as graduates and non-graduates. 


IQ IN THE TOTAL FAILURE GROUP 


The distribution of the total group of failing pupils according to 
IQ ranges from sixty-five to one hundred thirty IQ with the mean at 
98.25, SD 10.90. The median is 97.63, Q 7.77. Although the range 
is large, the variability shows a concentration of the group around the 
mean. The middle fifty per cent lie between 90.66 and 106.21 IQ. 
Fifteen per cent of the failures have an IQ of one hundred ten or above, 
generally accepted as superior ability, sixty-two per cent have average 
ability or an IQ between ninety and one hundred ten, and twenty-three 
per cent have inferior ability or an IQ below ninety. From a study 
of the few available IQ’s of secondary school pupils, it seems probable 
that the failures as a group are not far below the general IQ level of 
high school pupils in this section of the country, in secondary schools. 





1 This study was submitted in partial fulfillment of the requirements for the 
degree of Master of Arts in Yale University. 
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as heterogeneous in make-up as the school studied has become in recent 
years. 


IQ AND COMPARISON OF GRADUATES AND NON-GRADUATES 


The distributions of the graduates and non-graduates according 
to IQ are presented in Fig. 1. The range of scores is almost the same 
for the two groups. One hundred per cent of the non-graduates score 
between IQ’s sixty-five and one hundred twenty-nine. Ninety-nine 
per cent of the graduates score between the same limits. The mean 
for the graduates is 100.5, SD 10.40 and for the non-graduates is 
95.7, SD 10.75. This represents a difference in IQ of 4.80 between 
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the means, which is statistically a significant difference. The critical 
ratio is 8.35. The variability of the two groups is similar. The 
median for the graduates is 100.61, Q 7.50, and for the non-graduates 
it is 95.13, Q 7.82; the difference between the medians is 5.5 IQ points. 

The middle fifty per cent of the graduates score between IQ 
ninety-three and one hundred eight. This is identical with Terman’s 
estimate from Stanford Binet IQ’s of the middle fifty per cent of 
unselected children. The middle fifty per cent of the non-graduates 
score between 88 and 103.6. The interquartile range of the non- 
graduates is ney five points lower in IQ than that of the 
graduates. 

In spite of the statistically significant differences in central tend- 
ency between the graduate and non-graduate groups, the similarity 
between them is highly important for practical educational and 
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guidance purposes. One method of indicating similarity is by means 
of a measure of overlapping. When this measure is applied, it is 
found that thirty-four per cent of the non-graduates reach or exceed 
the graduate IQ median, fifteen per cent reach or exceed the seventy- 
five percentile, and four per cent reach or exceed the ninety percentile. 

The relationship between the graduates and non-graduates at 
each IQ level is shown in Fig. 2. Below one hundred IQ there are 
more non-graduates than graduates at each IQ level, while above 
one hundred IQ there are more graduates than non-graduates at 
each IQ level. The largest difference is eight per cent. The differ- 
ences regularly persist, but again the groups when viewed thus in 
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sub-groups show similarity rather than dissimilarity in respect to 
their brightness as measured by the IQ. 

For further consideration, the graduates and non-graduates were 
divided into the three general accepted mental ability classifications 
of superior, IQ one hundred ten and above; average, IQ between 
ninety and one hundred ten; and inferior, IQ below ninety. In the 
average group, which contains the largest number of cases, there are 
five per cent more graduates than non-graduates. In the superior 
group, there are eight per cent more graduates than non-graduates. 
In the inferior group there are thirteen per cent less graduates than 
non-graduates The similarity in the number of graduates and non- 
graduates within each IQ grouping is again more striking than the 
difference. 

Although there are reliable statistical differences in central tend- 
ency between the graduates and non-graduates in the failure group, 
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_it is obviously the similarities which are of practical importance. For 
administration and guidance purposes, it seems valid to conclude 
that the determining influence of IQ is relatively small as far as gradu- 
ation or non-graduation after failure is concerned. 


IQ AND CHRONOLOGICAL AGE 


Chronological age as a factor in school adjustment needs to be 
considered in relationship to grade placement. The graduates and 
non-graduates are therefore first compared according to age and grade. 
In order to place the chronological ages of the failures in relation 
to the high school group in which they are found, they have been 
compared with the total high school enrollment as found on the most 
recent age grade table. The median chronological age of the gradu- 
ates and for the total high school group is approximately the same for 
grades nine, ten, and twelve, while in grade eleven it is three months 
higher. In all grades the non-graduate median is about three months 
higher than the city median. While this shows a small statistically 
reliable difference, for purposes of educational differentiation a 
variation from the norm of three months in chronological age is 
generally of little importance. 

A comparison is made of the graduates and non-graduates by 
grades according to mean chronological age. In grade nine there is a 
difference of 5.9 months between the means, critical ratio 2.15; in 
grade ten 5.9 months, critical ratio 4.73; in grade eleven 5.7 months, 
critical ratio 2.73; and in grade twelve .1 months, critical ratio .35. 
These differences are statistically significant only for grade ten. 

In order to ascertain the relationship between chronological age 
and IQ for the graduates and non-graduates, correlations were figured 
for the two groups at each grade level, recognizing at the same time 
that IQ is already the ratio of mental to chronological age. There 
- is a similar negative correlation between IQ and chronological age 
for both the graduate and non-graduates, r. — .167 + .024 for the 
former, r. — .109 + .026 for the latter. Coefficients for boys and 
girls separately have also been computed. In seven cases out of 
seven, the girls show a higher negative correlation than the boys. 
This holds good for bath graduates and non-graduates, and for all 
grades. In this case, as is usual, the relation of underageness to 
brightness is higher for girls than for boys. 

Since the graduates and non-graduates are an average group as 
regards chronological age in relation to the high school population 
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in which they are found, showing little or no significant differences in 
central tendency, and presenting similar correlations between age 
and IQ, it seems valid to conclude that chronological age is not an 
important factor in conditioning graduation after failure. 


IQ AND SEX DIFFERENCES 


In the total enrollment in this high school for the years 1927 to 
1932, sixty per cent are boys and forty per cent are girls, while in the 
failure group, seventy-eight per cent are boys and twenty-two per cent 
are girls. In the graduate group, seventy-seven per cent are boys and 
twenty-three per cent are girls. In the non-graduate group, eighty 
per cent are boys and twenty per cent are girls. Although there are 
many more boys in the total failure group than the ratio of boys to 
girls in the high school would lead one to expect, there is little apparent 
difference in the proportion of boys and girls in the failure group who 
graduate and those who do not. 

The mean IQ of the graduates and non-graduates according to 
sex is compared as follows. The mean for the boys in the graduate 
group is 101.80 SD 10.0. The mean for the boys in the non-graduate 
group is 96.68, SD 10.60. This represents a difference in IQ of 5.12 
between the means, which is a statistically significant difference. The 
critical ratio is 8.15. The mean for the girls in the graduate group 
is 96.12, SD 10.48. The mean for the girls in the non-graduate group 
is 91.84,SD 10.80. This represents a difference in IQ of 4.28 between 
the means, which also is a significant difference. The critical ratio 
is 3.43. The variability of the two groups is similar. 

A comparison of the mean IQ of the boys and girls within the 
graduate and non-graduate groups reveals a statistically significant 
difference in mean IQ between the boys and girls in favor of the boys 
in all groups. In the total failure group, the difference in mean IQ 
is 5.08. The critical ratio is 7.13. In the graduate group, the differ- 
ence is 5.68. The critical ratio is 6.28. In the non-graduate group, 
the difference is 4.84 The critical ratio is 4.53. 

In comparing the per cent of graduates and non-graduates at 
each IQ level for boys and girls, there is a very noticeable similarity 
in the number of graduate and non-graduate boys, and graduate and 
non-graduate girls, at each IQ level. This similarity is much larger 
than the difference. However, among the boys, there are more 
non-graduates than graduates at each IQ level below one hundred IQ, 
while at one hundred and above there are more graduates than non- 
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graduates. Among the girls, there are more non-graduates than 
graduates at each IQ level below ninety IQ, while at ninety and above 
there are more graduates than non-graduates. In the superior 
group, ten per cent more boys graduate than fail to graduate, while 
among the girls, three per cent more graduate. In the average 
group, four per cent more boys graduate than fail to graduate, while 
among the girls twelve per cent more graduate. In the inferior 
group, thirteen per cent less boys graduate than fail to graduate, 
while among the girls fourteen per cent less graduate. More girls of 
somewhat inferior intelligence graduate after failure than boys of 
like intelligence. 

Although, as has been-indicated, there is a large preponderance 
of boys over girls in the total failure group, and their IQ as a group 
has been shown to be superior, fifty-seven per cent of the girls graduate 
after failure, while only fifty-two per cent of the boys graduate. This 
cannot be ascribed to any fundamental sex differences in mental 
ability, since all investigations have shown this to be negligible. But 
girls are regularly better students than boys of the same level of ability. 
In this failure group, as in schools generally, there is adequate evidence 
to show that the factor of being a girl is an advantage in reference to 
school progress. 


IQ AND COURSE OF STUDY 


Over half of the failures were enrolled in the college course, a quarter 
were in the academic course, and a fifth in the manual courses.. A 
comparison of the means and interquartile range of the graduates 
and non-graduates indicates that choice of course itself is selective 
to some extent with regard to IQ. However, there is a great deal of 
overlapping of IQ between the courses. The interquartile range of 
the failures in the college course is from 92.23 to 110.23; in the academic 
course from 88.08 to 104,42; in the boys’ manual course from 84.66 
to 101.50; and in the girls’ manual course from 80.0 to 97.57. The 
better pupils in the academic and manual courses are superior intel- 
lectually to the poorer pupils in the college course. Since it is the 
general opinion that the classical, scientific, and academic courses 
attract the more able students, except where family pressure and 
social conditions are interposed, and the vocational courses the least 
able, this factor of overlapping is of great consequence from the 
administrative and guidance aspect in education. 
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Comparing the IQ’s of the graduates and non-graduates within 
each course we find that in the college course the mean IQ for the 
graduates is 103.12, SD 9.80, for the non-graduates 99.12, SD 10.48. 
The difference in mean IQ is 4.00, which is a significant difference. 
The critical ratio is 5.24. In the academic course, the mean IQ 
for the graduates is 97.36, SD 10.04, for the non-graduates 94.72, 
SD 9.52. The difference in mean IQ is 2.64, which does not represent 
a significant difference for groups of this size. The critical ratio is 
2.54. In the boys’ manual course, the mean IQ for the graduates is 
95.40, SD 9.16, for the non-graduates 92.28, SD 10.20. The difference 
in mean IQ is 3.12. In the girls’ manual course the mean IQ for the 
graduates is 91.68, SD 9.52, for the non-graduates 87.76, SD 10.28. 
The difference in mean IQ is 3.92. In neither ‘manual course is the 
difference in mean IQ significant. The critical ratios are 1.93 and 
1.86 respectively. 

Since there is a definitely significant statistical difference in mean 
IQ between the graduates and non-graduates only in the college 
course, choice of course may be eliminated as a decided factor in 
graduation after failure except for the college course. The question 
may well be raised here as to whether suitable placement in other 
than the college preparatory course of individuals now tending to 
fail in that course might not reduce the total number of failures. It 
is in this respect that family pressure and social conditions make the 
situation most difficult. 


IQ AND SUBJECTS FAILED 


This part of the discussion has been limited to a study of the 
relationship of IQ to marks in English and mathematics, since these 
two subjects were in the curriculum of most of the failure group, and 
since the content of these two subjects is mutually as unrelated as 
any. The conclusion may be assumed as applicable to other school 
subjects for this group of pupils. Pintner' says there seems to be no 
marked tendency for the correlations between IQ and marks to be 
higher in any one of the usual academic subjects than in any other. 

The correlations between IQ and the two subjects are shown on 
p. 202. These correlations are much lower than the literature on the 
relationship between IQ and teachers’ marks in these subjects shows 
generally for the general high school population. Studies presenting 





1 Pintner, R.: ‘Intelligence Testing, Methods and Results.”’ 1931, p. 285. 
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RS ra a IQ and marks in English +.123 + .025 
Non-graduates............... IQ and marks in English +.104 + .027 
ESOC IQ and marks in mathematics | +.118 + .030 
Non-graduates............... IQ and marks in mathematics | +.043 + .032 











correlations between IQ and marks in English vary usually for such 
groups from .42 to .67. Studies presenting correlations between IQ 
and marks in mathematics vary from .33 to .80. It seems probable 
that the very low coefficients of correlation between marks and IQs for 
the students who have failed are an indication that these particular 
boys and girls are the most difficult for the teachers to appraise 
adequately, or at any- rate they are that part of the school population 
for whom other factors than IQ are most potent. 

Studying the percentage of graduates and non-graduates receiving 
marks from A to F in English and mathematics, we find that about 
twelve per cent more non-graduates than graduates were marked F 
in both subjects. Approximately the same per cent were marked D 
in both subjects. More graduates than non-graduates were marked 
A, B, and C in both subjects. Although there are these differences, 
there is also meaning for educational purposes in the similarity in 
respect to number of graduates and non-graduates receiving each mark. 

Since it is to be expected that marks should have some relation 
to the quality of mental ability, the graduates and non-graduates 
were divided according to the IQ classifications of superior, average, 
and inferior, and the sub-groups considered separately in respect to 
marks in English and in mathematics. To illustrate, Fig. 3 shows 
the percentage at each IQ classification for English. In both the 
graduate and non-graduate groups, more F’s in English were received 
by the inferior group than by the average, and more by the average 
than by the superior. In mathematics, on the other hand, an approxi- 
mately equal percentage of F’s was received by the inferior, average, 
and superior groups, for both graduates and non-graduates. Marks 
in English seem to have more relation to IQ than marks in mathe- 
matics. This is in accord with the general findings in the literature. 
A comparison of the number of graduates and non-graduates receiving 
each mark in each ability classification reveals for the most part an 
outstanding similarity. This likeness is so persistent that it leads 
to the conclusion that many of the main determinants leading to 
passing or failing marks are not to be sought in the IQ. 
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Teachers’ marks determine whether a pupil is to be rated a success 
or failure, whether he shall graduate or not graduate. Miller! says 
that ‘‘tests are probably a more reliable indication of what a pupil’s 
achievement in school should be than are his marks an indication 
of what his achievement has been.”’ Since in the failure group marks 
and the IQ have shown lower correlations than were to be expected, 
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an effort needs to be made to isolate those factors which are the cause 
of the low correlation. If, as Thorndike? says, marks were more 
measurements than opinions, IQ could be made a more powerful 


factor in conditioning graduation after failure than it has been shown 
to be. 


IQ AND EMPLOYMENT 


In the total failure group, thirty per cent worked outside of school. 
Thirty per cent of those who succeeded in graduating were employed, 
thirty per cent of those who did not graduate were employed. Among 
the boys, thirty-five per cent of the graduates worked, and thirty-two 





1 Miller, W. S.: ‘‘The Administrative Use of Intelligence Tests in the High 
School.” Twenty-First Yearbook, Nat. Soc. for the Study of Educ., 1922. 

? Thorndike, E. S.: ‘‘Measurement in Education.” Twenty-First Yearbook, 
Nat. Soc. for the Study of Educ., Part I, 1922. 
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per cent of the non-graduates. Among the girls, fifteen per cent of 
the graduates worked and twenty-one per cent of the non-graduates. 
There is no difference between the graduates and non-graduates in 
the percentage employed outside of school. 

The mean IQ of the graduates who were employed is 96.48, SD 
10.93. The mean IQ of the non-graduates who were employed is 
95.43, SD 10.48. The difference in IQ between the mean is 1.05, 
which is a significant difference. The critical ratio is 4.88. While 
the mean for the employed non-graduates is the same as the mean 
for the non-graduates as a whole, the mean of the employed graduates 
is four points lower than that of the graduates as a whole. The 
employed graduates and non-graduates are more alike in the central 
tendency of their distributions than the graduates and non-graduates 
as a whole. 

The percentage at each IQ level of graduates and non-graduates 
who are employed daily is presented. In the average group, IQ 
ninety to one hundred ten, there are sixty-seven per cent graduates, 
and sixty-two per cent non-graduates, a difference of five per cent. 
In the superior group, IQ above one hundred ten, there are twice 
as many graduates as non-graduates; in the inferior group there are 
half as many graduates as non-graduates. Perhaps one may say 
that if a pupil who has failed has at least average mental capacity, 
there is an even chance that employment will have no affect on his 
ability to graduate. If he has inferior mental ability, the additional 
burden of outside work is a disadvantage. If he has superior ability, 
employment in itself is no handicap. 

A like number of graduates and non-graduates are employed 
and in central tendency they are more similar in IQ than the graduate 
and non-graduate groups as a whole. However, pupils of high IQ 
who are employed outside of school seem to have a better chance to 
graduate after failure than pupils of low IQ. 


CONCLUSIONS 


Pupils with 1Q’s from sixty-eight to one hundred thirty-two 
have demonstrated that they fail in at least two subjects and after- 
wards graduate from high school. A similar number of pupils with 
the same range of IQ’s fail to graduate under like circumstances. 


Certain similarities and certain contrasts have been found between 
these groups. 
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1. The IQ is a factor definitely related to graduating ability in the 
group studied. This is shown by the significant statistical difference 
in central tendency between the failures “who graduated and those 
who did not. There are also definite similarities between the two 
groups which have as much or greater importance for guidance. 
These similarities indicate that the determining influence of IQ on 
graduation after failure is relatively small as compared with other 
factors which, though still largely unclassified, must be powerful. 

2. The graduates and non-graduates present low negative correla- 
tions between IQ and chronological age. Combined, they are average 
as to chronological age in relation to the general high school population 
and the school in which they are found. Compared with one another, 
they show little or no significant difference in central tendency. 
There is evidence that in this group chronological age does not affect 
graduation after failure. 

3. Although there is a much larger number of boys than girls in 
the total failure group, and their IQ as a group has been shown to be 
superior, a larger per cent of girls than boys graduate after failure. 
It is evident that the factor of being a girl is an advantage in reference 
to the situation of graduation after failure. 

4. Choice of course itself is selective in reference to IQ, except 
probably where family pressure and social conditions are interposed. 
This selectivity has been found similar for both graduates and non- 
graduates. There is a significant statistical difference in mean IQ 
between the graduates and non-graduates in the college course, but 
not in the other courses. This affords evidence that IQ is a more 
determinant factor in graduation when the pupil’s ability is being 
rated for college entrance. 

5. Similar numbers of pupils in both the graduate and non-graduate 
groups who have superior, average, and inferior IQ’s receive failing 
marks. The correlations between marks and IQ are low for both 
groups. It seems probable that the factors which cause the dis- 
crepancy between marks and IQ are in general the determinants of 
graduation after failure. 

6. A like number of graduates and non-graduates are employed 
outside of school. In central tendency the 1Q’s of the working chil- 
dren are more similar than those of the graduates and non-graduates 
asawhole. There is evidence that pupils of high IQ who work outside 


of school have a better chance to graduate after failure than working 
pupils of low IQ. 
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PRODUCT-MOMENT CORRELATION AS A RESEARCH 
TECHNIQUE IN EDUCATION 


PAUL HANLY FURFEY AND JOSEPH F. DALY 
The Catholic University of America 


Product-moment correlation is quite widely accepted as an ade- 
quate measure of ‘‘ closeness of relationship.”” Those who have studied 
the derivation of the formula will realize, however, that this is true 
only under certain known conditions. When these conditions are 
fulfilled, r has a definite meaning. When they are absent, r cannot be 
interpreted in any useful way. 

The present paper represents an attempt to discover whether the 
conditions just mentioned are usually fulfilled in current educational 
research. To this end the latest available years of five periodicals 
were examined, that is to say, the Elementary School Journal (Sep- 
tember, 1932-June, 1933), the Journal of Experimental Education 
(1933), the Journal of Educational Psychology (1933), the School Review 
(1932), and the department headed ‘‘Educational Research and 
Statistics” from School and Society (July, 1932-June, 1933). The 
results of this analysis are shown in Table I. Of the two hundred 
sixty-eight articles included in this brief survey, five were reviews of 


TaBLE ].—Tue USE or r In CURRENT EDUCATIONAL PERIODICALS 
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the literature quoting r, sixteen used r in the derivation of formulae, 
and sixty-three used r in the analysis of data. The present paper is 
concerned with the latter group of sixty-three. It is interesting to 
note in passing that eighty-four, or almost one third of the total 
number of articles used r in some way. 

We now proceed to inquire how generally this wide use of correla- 
tion is justified. Before doing so we must inquire what antecedent 
conditions must be fulfilled before r can be interpreted in a given 
manner. The facts are summarized in Table II.' Limits of space 


TaBLeE I].—INTERPRETATION OF THE PRODUCT-MOMENT CORRELATION 
COEFFICIENT UNDER SPECIFIED CIRCUMSTANCES 
CIRCUMSTANCES INTERPRETATION 
1. The sample is rigorously normal... r measures relationship in the said sample 
with entire adequacy by determining all 
the array moments. 
2. The sample is drawn from a normal r measures relationship in the said uni- 


universe. verse by determining all the array moments 
within limits given by known standard 

errors. 
3. The sample is rigorously linear.... r measures relationship in the said sample 


by determining the first moments of all 
the arrays and an average second moment 
about each regression line. 

4. The sample is drawn from a linear r measures relationship in the said universe 
universe. by determining the first moment of all the 

arrays and an average second moment 
about each regression line within limits 
given by known standard errors. 

5. The sample is approximately linear r may be interpreted as in (1) to (4) above, 
or normal, or is drawn from an ap- but with an indefinite degree of error. 
proximately linear or normal uni- 
verse. 

6. The normality or linearity of the r has no definite interpretation. 
sample has not been determined. 


do not permit a justification of the statements contained therein, but 
they are all familiar to mathematical statisticians. 

The first four cases may be handled together. As one may see, 
these four cases demand perfection. Either the sample itself or the 
universe from which it is drawn must be normal or linear in a 





1 For a fuller treatment see Furfey, Paul Hanly and Daly, Joseph F.: The 
interpretation of the product-moment correlation coefficient. Washington, Catholic 
Education Press, 1934, pp. 57. (The Catholic University of America Educational 
Research Monographs, Vol. VIII, No. 4.) 
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mathematically exact. sense. Every experienced statistician will 
recognize that perfectly normal or perfectly linear bivariate distri- 
butions never—or almost never—occur in practice. This fact elimi- 
nates cases one and three from the realm of practical usefulness. 
There remains the possibility that, although the sample being 
studied is non-normal or non-linear, the universe from which it is 
drawn may be normal or linear. This is the possibility considered 
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Fic.1.— Distribution of Absolute Magnitudes of 1354 Correlation Coefficients. 


in cases two and four. It is well to remember that at this point we 
enter the realm of pure theory. No one ever examines a statistical 
universe. No one ever examined, for example, the bivariate universe 
embracing the Stanford-Binet and Otis scores of all twelve-year-old 
children. At best we have occasional samples from such universes. 
To certain enthusiasts, inspection of such samples indicates the normal 
or linear nature of the universe. It may conservatively be stated, 





Qo © ®& & &© re 


as © Oo & © 








29 


al 
3e 
id 


al 





Product-moment Correlation 209 


however, that the normality or linearity of no bivariate universe has 
ever been s¢ientifically proved. What we know about the normality 
of the few univariate universes which have been rather casually 
studied, would certainly incline one to suspect that normal bivariate 
universes, at least, are extremely rare if not non-existent. 

It is worth noting that even in the impossible case that all the 
thirteen hundred fifty-four correlations occurring in the articles 
studied were drawn from normal universes, r would still lack a very 
definite meaning. The accompanying Fig. 1 shows the distribu- 
tion of the absolute values of the r’s. The median value was 0.44. 
Table III shows that the median number of cases used is just under 


TaBLeE IJI.—DistripvuTion or N’s Usep IN THE CALCULATION OF THIRTEEN 
HuNDRED Firry-FouR CORRELATION COEFFICIENTS 


i, <4 004 ee See es Sake 6 base ee eae reed detetndeds 11 
DT os LA, aris hae Ley ek. Pedal dt a eis Cawle G 6 
ee ee ee ee ee ee ee eye 46 
a iia a cal a a ee 71 
I ae a tia ee Be ee a ile a 93 
ct cee ae See onan Tec neuees cs sae edadaes 154 
atin cs bah hua bab ded 60% Cet be dee dickndeecneenks 206 
Se Lie eke i Sie td oadéve adeidne us thbe teat 218 
es eke, cel ae Die cy bee ed baba wekhe cis 224 
a a i eg ll 117 
Re ee lees On EE he ebb h se eede baked 21 
DL. Cod ch tessa cela Gs ek acdet nec edeeh end en eit 74 
DCC ceCL a. MCs se shee e heb ae 664s Cbdoe che cade deades 44 
eS ad hei Pek ee ee ie bbilewa biced 60 OU Re) aie caes 48 
enh Putten da tw duh ous Haut cee hed an kh Kanes 21 
ot a ais oda eile inn Sell 6 de a el a ee hile Ae ea abe ed eet 1354 


ninety-one. Now a correlation of 0.44 based on ninety-one cases has 
a standard error of about 0.08 and a probable error of just under 
0.05. In other words the median correlation used in this sample of 
educational research has merely an even chance of indicating a ‘‘true”’ 
correlation of between 0.49 and 0.39. Even in this ideal case, then, 
correlation leaves something to be desired. 

Case five includes those instances in which the conditions of cases 
one to four are only approximately fulfilled. At this point it is essen- 
tial to note the following proposition: When a numerical result is 
derived from approximate data and when the degree of approximation 
of the latter is unknown, then the numerical result is not really quanti- 
tative. To illustrate—a chemist has in his laboratory a bottle of 
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NaOH which he knows is approximately decinormal. He does not 
know, quantitatively, how nearly decinormal it is. He titrates an 
acid of unknown strength against this base and obtains the result that 
the acid is 0.1318. Of course no good chemist would accept this result 
as truly quantitative; for it is based on an unknown approximation. 
Just so, when a correlation is based on a regression which is only 
approximately linear and when the degree of that approximation is 
unknown, then the obtained coefficient is quantitative merely in 
appearance. Actually, it is no more quantitative than a verbal 
statement because it depends entirely on the verbal statement that the 
distribution was approximately linear. 

Attempts have been made to measure linearity quantitatively. 
The only technique which need be seriously considered is that based on 
the comparison of r with eta, the correlation ratio. If ris nearly equal 
to eta, then the distribution is considered linear and r is accepted as a 
measure of relationship. If the two coefficients are not nearly equal, 
then the distribution is considered non-linear and r is discarded in 
favor of eta. It is hard to see how a product-moment enthusiast can 
bring himself to use this technique, for it implies that eta is superior 
to r. While this is possibly true, it is a surprising admission! 

Yet even this technique does not solve the difficulty. The com- 
parison with the chemist was really too favorable to the use of r. For 
when a chemist knows the titer of a reagent within, say, one per cent, 
then he knows that the error of his titration from that source will also 
be within one per cent. But correlation suffers from a more funda- 
mental difficulty. Even when we have a good measure of the depar- 
ture from normality or linearity of the bivariate distribution, we do 
not know how that affects r. To be concrete, who can say which 
denotes the closer relationship—an r of 0.50 with a zeta equal to twice 
its standard error or an r of 0.60 with a zeta equal to half its standard 
error. The only answer to this question would be to use eta or some 
other measure of correlation. It is not a question which can be 
answered in terms of r alone. 

All writers on correlation—even the very uncritical authors of 
elementary textbooks—agree that r must not be used without at least 
a previous determination of linearity. When this precaution is 
ignored—and this is case six—no definite conclusions can be drawn, 
except from the minimum value of efa thus obtained. 

In spite of the entirely unjustifiable nature of this practice, it seems 
to be the usual procedure of American educational research workers, at 





me te tee ee 


Ss pee oo Ss —- = fm OlUCUKlCUMD 


~~ @ 


rer ct ©O FF co a ot HS 








rd 





Produce-moment Correlation 211 


least if one judges from the published data. Of the sixty-three articles 
in which product-moment correlation was used, only seven gave any 
information about the normality or linearity of their distributions, or 
any data from which such information could be deduced. In two 
instances the Blakeman test was used—a procedure essentially the 
same as the use of zeta. In another, two modified scatter diagrams 
were published. In two articles inspectional methods were employed, 
one author remarking the apparent normality of the marginal totals 
and another stating that the bivariate distribution seemed to be 
normal. In one instance artificially normal marginal totals were set 
up, while in the seventh case the author stated that his regressions were 
not linear, after which he nevertheless proceeded to use r. Among all 
these only the publication of the scatter diagram and the use of the 
Blakeman test need be considered seriously, and the Blakeman test 
proves the possibility rather than the existence of linearity. 

As for the remaining sixty articles, their authors have left them- 
selves open to the suspicion of having employed the correlation tech- 
nique in a way which is meaningless, if not positively misleading. 

A final question is pertinent. If r is such a poor instrument for 
measuring relationship, what substitute may we propose? Unfor- 
tunately this question has no simple answer. As a practical procedure 
we might suggest that whenever r is given certain additional informa- 
tion be furnished along with it. If writers would given not only r, but 
both efas and zetas, and in addition, perhaps the regression, scedastic, 
clitic and kurtic curves, or at least the first two of these curves— 
then many of the unjustified conclusions now based on r would not 
be drawn! 


SUMMARY 


An examination of articles published in recent issues of five edu- 
cational periodicals shows that product-moment correlation is being 
used with little regard to the fulfillment of the necessary antecedent 
conditions. 
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WISHART’S EXACT FORMULA FOR THE STANDARD 
ERROR OF THE PRODUCT-MOMENT TETRAD 
VERSUS AN APPROXIMATION FORMULA 


EDWARD E. CURETON 
Alabama Polytechnic Institute 


Wishart, in a paper in Biometrika,! gives a table of moments of the 
simultaneous sampling distribution of variances and covariances. 
Certain of the results from this table are of importance in the derivation 
of approximate formulas for the sampling variances, sampling standard 
deviations (standard errors), and sampling covariances of functions 
such as the tetrad, the correlation corrected for attenuation, etc. 
These results are given in terms of the population variances and 
correlations. They may be stated with equal accuracy and greater 
simplicity in terms of the variances and covariances. They all rest 
on the assumption that the simultaneous distribution of the variates 
in the population is a multivariate normal distribution. 

The most valuable of Wishart’s results for our purposes may be 
stated as follows: 


No?,,, = 20°11 
Noo, = 20°12 
No?,,, = 11022 + 9712 
No,,,0; = 2011012 (1) 
No.,,0,; = 2012013 
No aye; = 611023 + F12013 
No ass; = 013024 + 014023 


where o? with subscripts other than numerical is a sampling variance, 
¢ with two literal subscripts is a sampling covariance, 
s with numerical subscripts is a variance or covariance from the 
: sample according as the subscripts are the same or different, 
and o with numerical subscripts is a variance or covariance from the 
population according as the subscripts are the same or different. 
As given by Wishart, the left-hand member of each of these 
equations contained the correct factor (N — 1) instead of the only 
approximately correct factor N. It is obvious that as N increases, the 
relative importance of this discrepancy decreases. 





1 Wishart, J.: ‘‘The generalized product moment distribution in samples from 
a normal multivariate population.” Biometrika, Vol. XX, 1928, pp. 33-52. 
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The usual method of deriving the sampling variance of a function 
of the variances and covariar.ves is to take the differentials or logarith- 
mic differentials of both sides of the equation, square, sum over the 
theoretically infinite population of successive. samples, and divide 
by this theoretically infinite number. In performing the last two of 
these operations, we substitute for each squared differential the 
corresponding sampling variance, and for each product of two differ- 
entials the corresponding sampling covariance. 

In a paper in the British Journal of Psychology,' Wishart voices a 
strong objection to this procedure as used by Kelley, Pearson and 
Moul, and Spearman and Holzinger in obtaining approximations to the 
standard error of the tetrad. He says, 

“Hitherto the standard deviation (standard error of the tetrad) has 
been known only approximately in the form of one or two terms of an 
expansion proceeding in inverse powers of N, the number in the 
sample. But an exact formula is much to be preferred. If N is not 
really large the first term or two in an expansion will not be correct 
enough. A more serious objection is that in particular cases the early 
terms of a series may vanish, and the first term of importance may be 
a term neglected.” 

A knowledge of the conditions under which the exact formula is 
necessary, of the limits of error of an approximation formula, and of the 
actual number of subjects necessary before a sample can be called 
“really large” should be of importance. 

Pearson and Moul? have criticized the attempt of Spearman and 
Holzinger to obtain the sampling error of the correlation tetrad beyond 
a first approximation. They have noted that the error in the tetrad ¢ 
introduced by substituting the sample correlations r for the population 
correlations p is of the order of o,, that is, of the order 1/\/N. They 
say (p. 251), 

“It is idle therefore to retain terms of the order 1/N? when we neg- 
lect terms of the order (1/N)(1/+/N), which occur in connection with 
replacing p by r in the terms of order1/N. The terms in 1/N* would 
only be of value provided we were dealing with small sampling with 
a-priort known values of the p’s.”’ 





1 Wishart, J.: “‘Sampling errors in the theory of two factors.”” British Journal 
of Psychology, Vol. XIX, 1928-1929, pp. 180-187. 

* Pearson, K., and Moul, M.: ‘“‘The mathematics of intelligence. I. The 
sampling errors in the theory of a generalized factor.’”’ Biometrika, Vol. XIX, 
1927, pp. 246-291. 


ij 


214 The Journal of Educational Psychology 


The writer can see no reason why this criticism does not apply 
with equal force to the case of the product-moment tetrad or any other 
function of the variances and covariances. Since it has been shown 
by Pearson, Jeffery and Elderton! that we cannot apply Wishart’s 
formula for the sampling variance of the product-moment tetrad to 
the correlation tetrad, we shall first derive a formula giving the 
sampling variance of the product-moment tetrad to terms of the order 
1/N, and then try to see how closely this approximates the results 
given by Wishart’s formula. 

The product-moment tetrad may be written 


t = 813824 — 814823. 


Taking differentials, squaring, and substituting from (1), we obtain as 
our approximation formula, 


oc, = J [3(ers0 ” 714023)? + (011022 sa 7712) (33044 — o734) _ A], (2) 


where A is the determinant |o;;|, 7, 7 = 1, 2, 3, 4. 
Wishart proposes that instead of t, we substitute 
N? , 
~ (N — 1)(N — 2)" 
The reason for this suggestion is that the mean value of ¢’ in successive 
samples is equal to the true corresponding tetrad in the population, 
whereas the mean value of ¢ in samples differs from the true correspond- 
ing tetrad in the population, in such a manner than 
7 — N — IW — 2) 
= V3 
t being the mean value in successive samples of ¢t, and 7 being the 
corresponding true population tetrad, of which ¢ is an estimate. For 
large samples, ¢’ will not differ from ¢ by an amount which is important 
in comparison with the sampling variance, nor will ¢ vary appreciably 
from 7’. 
Wishart also gives the sampling variance of ¢ for the case of the 
multivariate normal distribution. This may be written, 


o*; = (NV ies ee aa 2) s», + N J j (711922 = 7712) (33044 pa | 


(3) 


t’ 








Tr, 











1 Pearson, K., Jeffrey, G. B., and Elderton, E.: ‘‘On the distribution of the first 
product moment-coefficient in samples drawn from an indefinitely large normal 
population.” Biometrika, Vol. XXI, 1929,-pp. 164-193. 











i, i i 


he 


nt 
ly 


she 


(3) 


first 
mal 





Wishart’s Formula versus Approximation Formula 215 


where S?, is the right-hand member of (2), omitting the factor 1/N. In 
this formula, the factors (VN — 1) instead of N from Equations (1) are 
kept, as are also the terms of all orders according to Wishart. 
Comparing (2) and (3), we see that in the case of the standard 
error of the product-moment tetrad we make two errors when we use 
the approximation formula. 
The first error consists in substituting 1/+/N for 


(N —1)VN -—2 
N? 
If N = 25, this is equivalent to multiplying o, by 1.086. But it is 
common practice in reporting standard errors to neglect the second 
significant figure if the first is greater than two. If the first significant 
figure were four and the second six, the error committed by neglecting 
the second and reporting the first as five would be equivalent to multi- 
plying the correct value of the standard error by 1.087. It would 
seem therefore that as far as this first error is concerned, a sample as 
large as twenty-five is “‘really large.”” And it is to be noted that this 
error always increases the value of o; as given by (2). 
The second error consists in neglecting the expression, 


2 


_ j (711722 re 7712) (33044 = o734). 





The approximation formula is not the first term of a series proceeding 
in inverse powers of N so far as this error is concerned, and in fact the 
only way to obtain such a series is by the expansion of 2/(N — 1), 
which is idle in view of the fact that all terms in the above neglected 
expression are of order 1/N? in (3). There is no absolute limit to the 
importance of this error. In fact, if all six of the covariances are equal 
to zero, formula (2) will give o, = 0, while (3) will give, 


al (2(.N — 1)(N = 2) (6110 220330 44)]” 

o, = N? ° 
It is the population covariances which have been assumed here to be 
equal to zero. The sample covariances are still open to fluctuation 
from sample to sample, and so also is¢. It is therefore obvious that o; 
cannot be equa! to zero. This is the case pointed out by Wishart in 
which the first (and in fact the only) term of importance in ¢; is a term 
neglected in the approximation formula. But we must remember that 
the population variances and covariances are unknown, and that in 
practice we must substitute the sample values in either (2) or (3). 
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This error, as has been shown by Pearson and Moul, is of lower order 
(t.e. of greater magnitude generally speaking) than is the error involved 
in neglecting the terms multiplied by 2/(N — 1). Consider a hypo- 
thetical example. Let 


O11 = O22 = O33 = Oug = a 
G19 ™ Cag @ Cg = CHE @ Om = Ou = 0, 


and assume that we have a sample of 100. From (1) we find that 
PE,,, = .0954, and o,, = .1. The probable errors of the other three 
variances are likewise equal approximately to .1, and the standard 
errors of the other five covariances are all .1. It is not unreasonable to 
assume that in a random sample two of the variances will be one PE 
above the population value while the other two are one PE below it. 
It is also fairly reasonable to assume that two of the covariances will 
coincide with the population value, two will be one standard error 
above it, and two will be one standard error below it. If we take 
$12 and s34 equal to zero in order to maximize the terms multiplied by 
2/(N — 1) in (8), and hence to emphasize the difference between the 
values given by (2) and (3), we may write, 


811 = 822 = a 
S33 = 8a = 9 
Si2 = 834 = O 
Sis = 84 = «1 


823 = 845 = —.1 


Of the values 813, $14, S23, aNd Se, it is immaterial which two are assigned 
the value +.1 and which two the value —.1; the value of ¢ and of o, by 
both formulas will remain the same. In this case, ¢ = 0, and 
o;, = .0199 by the shorter formula and .0239 by the longer one, on 
replacing the o’s in (2) and (3) by the s’s from the sample of 100 as 


given above. If si; = 833 = 1.1 and Soe = sa = .9, the values of 


¢, by the shorter and longer formulas would only change to .0200 and 
.0240 respectively. 

Now going back to the o’s, which in our hypothetical example 
are known a-priori, we find that the correct value of a; by (3) is .0139. 
Both of the values computed from the sample are too large, and in this 
case it happens that the value given by Wishart’s formula is further 
from the truth than is the value given by the approximation formula. 
The difference between the two values of «;, computed from the sample 
is .0040. The average of the two differences between the values 
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computed from the sample and the correct value is .0080. This is just 

twice as great as the difference between the values given by the two 

formulas, and over half as great as the correct value of o; itself. 
Consider another hypothetical example. Let 


011 = O22 = O33 = Oun = a as before, 
Oi = O13 = O14 = O33 = Ong = ODE = O, 


and assume again that we have a sample of 100. Then PE,, = .1, 
approximately, as before; and o,,, = .1118 = .1, approximately, still. 
We shall again assume that two of the variances in the sample are one 
PE above, and two are one PE below the true value; and that two of 
the covariances of the sample coincide with the true value, two are one 
standard error above it, and two are one standard error below it. 
Then assigning values so as to maximize the term multiplied by 
2/(N — 1), we may write, 


811 = 8e2 = 1.1 
833 = Su = 9 
812 = 824 = .4 
813 = 84 = 6 
823 = 844 = .5 


From these values we find that ¢t = 0, and o, = .0657 by (2) and 
.0654 by (3), on replacing the o’s in these formulas by the corresponding 
s’s as given above. The correct value of o,, obtained from the a-priori 
known o’s by (3), is .0501. The difference between the two values 
obtained from (2) and (3), using the sample s’s, is .0003; and the 
average of the two differences between these values and the correct 
value is .0154. This last value is very much greater than the difference 
between the values given by (2) and (3); and it is well over one fourth 
as great as the true value of o;. Both of the sample estimates in this 
example are again too high, but the estimate obtained from Wishart’s 
formula is slightly better. 

It would seem on the whole, therefore, that in all practical cases in 
which the sample values of variances and covariances must be sub- 
stituted for the corresponding population values in the formula for the 
standard error of the product-moment tetrad, no advantage is to be 
gained by using Wishart’s exact formula rather than the approximation 
given by (2). It is also obvious that the practice of reporting standard 
errors to only one significant figure (or to two if the first is one or two) is 
fully justified. 





THE RELATIVE DIFFICULTY OF THREE 
ACHIEVEMENT EXAMINATIONS 
T. G. FORAN 
The Catholic University of America 


AND 
SISTER M. EDMUND LOYES 


Bernardine Sisters, Reading, Pa. 


There are now available four general achievement examinations 
for use in the elementary school. These are the New Stanford, the 
Modern School, the Metropolitan, and the Unit Attainment Scale. 
The four examinations are quite similar in general structure, covering 
substantially the same subjects of instruction, and providing highly 
reliable measures of achievement. The norms for the Modern School 
Achievement Test are based on the scores of six thousand seven 
hundred ten pupils while those of the New Stanford are based on 
a selected group of two thousand pupils. No information is given 
in the manual of directions for the Unit Attainment Scale regarding 
the source of the norms. The absence of such information in regard 
to the Unit Attainment Scale and the comparatively small numbers 
of scores used in establishing the norms for the New Stanford and 
Modern School tests suggested an investigation of the comparative 
difficulty of these three examinations. 

The three scales, the New Stanford V, the Modern School Achieve- 
ment Test I, and the Unit Attainment Scale, Form A, Division 2, 
were given in rotation order to three classes of sixth grade pupils. 
The rotation order of giving the tests was employed to equalize the 
practice effect. The three tests were given in three successive weeks. 
The schedule for the administration of the tests was: 








School A School B School C 
ee Stanford Modern School | Unit Attainment 
Second week............ Modern School | Unit Attainment | Stanford 
Third week............. Unit Attainment | Stanford Modern School 














The tests were scored and re-scored by different persons and all 
computations were checked. The means and standard deviations 
of the scores were found for each of the exercises in the three tests and 
for the total score (educational age). To facilitate direct comparison, 
the means and the standard deviations are expressed in terms of age as 
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all scores were converted into age scores by means of the table of norms 
provided by the respective tests. The correlations were found between 
the various pairs of tests of the same subject and between the educa- 
tional ages. 


The means and standard deviation of the educational ages are: 








Test N Mean | SDai«. (months) 
Unit Attainment scale.................... 128 12-7 14 
De. cs ewes bbbeecbece 128 12-0 12 
EM. ss cec ee cubeneheeebes 128 11-9 11 
NEE TES CEP PEPE ee 128 12-6 12 














Since the method of arranging the administration of the tests would 
equalize the practice effect, the differences between the means imply 
differences in the difficulty of the three scales as far as the present 
standardization is used. The Unit Attainment Scale yields con- 
siderably higher scores than do the other two batteries. There is a 
difference of ten months between the Unit Attainment and the New 
Stanford. The difference between the New Stanford and the Modern 
School is three months which is slightly less than three times the 
standard error of the difference, 1.13. The differences between the 
means are rather large, especially in the case of the Unit Attainment 
Scale and would be responsible for quite different interpretations of 
the achievement of this group of one hundred twenty-eight sixth- 
grade children. According to the Stanford Achievement Test, they 
are considerably below the norm whereas they reached the norm on 
the Unit Attainment. 

The correlations between the three series of educational ages are: 


Unit attainment and Modern school................. .713 + .03 
Unit attainment and New Stanford.................. .763 + .03 
Modern school and New Stanford................... .761 + .03 


Since the standard deviations of the distributions of educational ages 
are in the vicinity of twelve months, the standard errors of estimate 
are large. The standard errors of estimate are as follows: 


MonrTss 
ELISE OE Te SDar,, = 9.8 
eee ae Te 6 sb eee Seka SDat,,, = 8.4 
EEE Oe ee EOP SDat,, ™ 9.2 
ES a SS SDat,. 5 = 7.2 
ee SDat, = 7-8 
ET er SDs, = 7.1 
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Table I contains the means, standard deviations, and the coef- 
ficients of correlation between the various tests of the same subjects. 
Although there is some agreement between two of the three tests of a 
subject, the three scales never agree exactly and some of the differences 
are quite large. Of the twenty-four comparisons of means that are 
possible, only six of the differences are less than five months and nine 
of them are twelve months or more. The three tests show the most 
agreement in Computation and in Language Usage and least agree- 
ment in Reading where the three differences are fifteen months, seven- 
teen months, and two months. Although the Stanford generally 
yields lower scores than the other two tests and the Unit Attainment 
the highest scores, the trend is reversed in Reading with the highest 
average being the Stanford and the Modern School providing the 
lowest. It can thus be seen that the differences in difficulty among 
these three tests are not consistent but vary with the subject measured. 
The amount of the differences between the three pairs of tests may be 
summarized as follows: 











Difference Unit A and Unit A and Modern and 
(months) Modern Stanford Stanford 
ee OY eee ee eee 1 
16 
15 1 
14 Rial as: epee 1 
a oe” Pore 1 1 
12 1 1 
ll 
10 
9 
ay: 8 .- eeews 1 
mia, | AME 9 es 1 
6 eae Sa 1 
eS ae eee 3 2 
4 
3 
2 1 1 1 
ee eee eee ee 1 
0 2 
i be dat bate Pens ae s 8 8 
RS 9 mos. 6 mos. 5.5 mos. 
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The correlations between the tests are also given in Table I. The 
correlation coefficients range from 0.239 (Language Usage: Unit Attain- 
ment and Modern School) to 0.898 (Geography: Stanford and Modern 
School). The latter is surprisingly high but the two tests are very 
similar in form as well as in content. 

The results of this investigation indicate the necessity of using 
great caution in evaluating achievement with reference either to the 
norms or to the scores obtained in other subjects. The trend shown 
by the results from one test may be altered greatly by the results from 
another test. Under these circumstances it appears that extensive 
remedial instruction would be justified only when the results from one 
test are confirmed by those of another. The “best” and “poorest” 
subjects indicated by each of the three examinations are: 








Unit Modern New 
Attainment School Stanford 
DEE ech cdadbeeakousesviode Spelling Computation | Reading His- 
Language tory 
EE 5s wa kok ca eae baw e ade History Reading Computation 
Language 














The above comparisons exclude Health, Literature, and Elementary 
Science since only two of the three examinations contained tests in 
these subjects. It is to be observed that the subjects with the highest 
scores on the Modern Language Test have the lowest scores on the 
New Stanford. History, which is the “poorest” subject according 
to the Unit Attainment Scale is one of the best according to the New 
Stanford. Reading reveals the same inconsistency between the 
Modern School and the New Stanford. It is quite clear that identifi- 
‘cation of skills and deficiencies by means of any one of these tests 
risks contradiction by some other test. 

The necessity of using local norms has been emphasized for some 
time and these data appear to confirm the opinion that accomplishment 
must be judged in relation to the particular assignment rather than in 
relation to the norms provided by such general achievement examina- 
tions as those that have been considered in this investigation. 
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WHAT HAPPENS WHEN THE SECOND JUDGMENT IS 
RECORDED IN A TRUE-FALSE TEST?! 


EDNA E. LAMSON 


State Normal School, Jersey City, N. J. 


The undergraduate has been advised frequently by his sophisticated 
friends as to the technique to be used in recording his judgment of the 
several items in a true-false examination. One line of advice is: 
“Record your first impression upon reading the item. If you decide 
the item is a true statement, use the proper scoring code and do not 
change it. If you change, your final judgment is likely to be incor- 
rect.”” Students who have followed this advice in their first experience 
with true-false examinations have been surprised at their low total 
scores. Many first impressions they would have changed upon second 
reading and second thought but for the advice of friends who had 
taken true-false examinations. These and similar experiences arouse 
in students’ minds questions like the following: ‘‘ Having recorded our 
judgment as to the truth or falsity of the statement on first reading, 
may we change this judgment upon second reading if we so desire? 
Or, do you (the instructor) prefer to have us record our first judgments 
only?” 

In order to determine what directions to give students taking her 
own true-false examinations, the writer gathered the data of this 
study. 

The subjects whose judgments in true-false examinations furnished 
these data were students in thirty-two- and forty-eight-hour courses 
conducted by the writer during the academic year 1929-1930, the 
summer session of 1930, and the first semester of the academic year 
1930-1931. The students of 1929-1930 and of the summer session 
of 1930 were registered in Ball State Teachers College, Muncie, 
Indiana. Those of the first semester of 1930-1931 were registered in 
the School of Household Administration, University of Cincinnati, 
Cincinnati, Ohio. Although the majority of these students in both 
institutions were enrolled in junior and senior classes, nearly one-half 
of them were enrolled in freshmen or sophomore classes. 





1 Paper read before the Midwestern Psychologiéal Association, Chicago, May, 
1931. 
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THE DATA 


The data in this study were taken from fifteen hundred eleven 
individual papers, including three hundred eighty-four from the 
freshman group, two hundred seventy from the sophomore group, and 
eight hundred fifty-six from the junior-senior group. The total 
number of items from these papers was one hundred forty-four 
thousand three hundred seventy, of which a little more than fifty per 
cent were true items. The content of these courses covered several 
phases of psychological knowledge, including advanced educational 
psychology, psychology of adolescence, psychology of childhood, edu- 
cational tests and measurements for intermediate grades, educational 
tests and measurements for primary grades, mental and physical 
development of children from birth through adolescence, behavior 
problems for young children, and mental hygiene. One course was an 
elementary course in the introduction to education. 

At the beginning of each examination, the instructor made this 
statement: “I am especially interested to ascertain what results when 
a student changes his judgment on an item in a true-false test. When 
a student changes a plus (our code for a true statement) to a zero 
(our code for a false statement) is the second judgment right or wrong? 
You may change your mind on any item, but draw a line through the 
first judgment that I may know what it was. Following the discarded 
judgment, record the one you want me to consider when I score your 
paper. No one will be penalized in any way for changing his first 
judgment to a second.” 

The direction and the results of the changes are presented in 
Table I. 


TaBLE I.—NvuMBER AND DIRECTION oF CHANGES Mapge In ONE HUNDRED 
Forty-Frour THOUSAND, THREE HUNDRED SEVENTY JUDGMENTS 











MED, 8 no acseees secre sapedecvwes eae 3147 
Ns eA ac cabne seeker bo4 bee 4e op a ak 6 1792 
a oe eae ee paneaeae ns ctweeen ies eta 1355 

Number of correct changes..................6.0000 ee eeeee wr 2066 
ile sy sé Mado Abn seaets scab qeeleeesey’d 1207 
From false to true....... SEI OT ee EE ee Te ee 859 

Number of incorrect changes....................0 0000 -000s bites 1081 
Te cede edhun ka tab 6sh,sUhdvekae ween 585 
a oi oi OD eas Ls dein vd w ecb kek de 496 

Ratio of correct changes to changes incorrect................ aes 2 to 1 
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The data on changes from first judgments to second judgments 
consist only of changes from plus to zero or vice-versa which were . 
obvious to the writer because the students had followed directions in 
recording their changed judgments. Of course there is the possibility 
of error in the assumption that the recording of all second judgments 
has been done according to directions. It is possible that in some cases 
the only judgment recorded was the second. The data consist of only 
such changes as were made after a judgment had been recorded on the 
score sheet. 

The total number of changed decisions recorded on fifteen hundred 
eleven papers was thirty-one hundred forty-seven. 


ANALYSIS OF THE DATA 


The analysis of the data yielded the percentages presented in 
Table II. 


TaBLE IJ].—CHARACTERISTICS OF CHANGES OF JUDGMENTS IN TERMS OF PER 


CENT 
ELSIE SES FL EE EO ES SOT EPIL TE 2.2 
Per cent of changes from plus to zero................ 0 cece cece eee eee 57 
Se Or UNE CUI SUID UP On ok ccc cece s ccctavesecececnecess 43 
Per cent of correct changes from plus to zero...................02200005- 68 
Per cent of incorrect changes from plus to zero..................2.2.-008. 32 
Per cent of correct changes from zero to plus........................005- 63 
Per cent of incorrect changes from zero to plus.......................... 37 
Per cent that correct changes were of total changes...................... 66~- 
Per cent that incorrect changes were of total changes.................... 34 
Per cent of all correct changes that were changes from plus to zero........ 60 
Per cent of all correct changes that were from zero to plus................ 40 
Per cent of all incorrect changes that were from plus to zero.............. 54 
Per cent of all incorrect changes that were from zero to plus.............. 46 


In seventeen hundred ninety-two second decisions, or fifty-seven 
per cent of the cases, the change was made from a plus to a zero. In 
thirteen hundred fifty-five second decisions, or forty-three per cent 
of the cases, the change was made from zero to plus. In sixty-eight per 
cent of the cases, the change from plus to zero resulted in the correct 
judgment. But in only sixty-three per cent of the cases did the 
change from zero to plus result in the correct judgment. Sixty per cent 
of all the correct changes were from plus to zero. The chances for the 
right result are greater when the plus is changed to zero than when the 
change is from zero to plus. The chance of recording the smallest 
percentage of wrong judgments is in changing a zero to a plus. 
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Of the changes made twenty hundred sixty-six, or sixty-six per cent, 
resulted in correct judgments; ten hundred eighty-one, or thirty- 
four per cent, resulted in incorrect judgments. When changes are 
made, the chances are two to one that the second judgment is the 
correct judgment, when both the changes from plus to zero and 
the changes from zero to plus are taken into consideration. 

The percentage of the number of changed judgments was 2.2. The 
first judgment on two out of every hundred items was changed. 
The average number of changes per individual paper was 2.1. Whether 
the data are considered with respect to percentage of changes or with 
respect to the average number of changes per paper, the amount of 
changing is strikingly small. The ratio of the percentage of right 
judgments obtained in changing plus to zero to the percentage of right 
judgments obtained by changing zero to plus is 1.08. The proportion 
of right judgments resulting when the change is made from plus to 
zero is eight per cent greater than the proportion of right judgments 
resulting when the change is made from zero to plus. Out of every 
hundred chances there are fifteen less of recording incorrect judgments 
when the change is from plus to zero than when it is from zero to 
plus. 

A comparison was made between number of true items missed 
and number of false items missed. This comparison is based upon 
the analysis of the data collected during the fall semester of 1930-1931. 
These data consisted of 52,817 items from five hundred forty-seven 
papers passed in by one hundred twenty-six individuals. Of these 
items, 29,053, or fifty-five per cent, were true statements; 23,764, or 
forty-five per cent, were false statements. Of the true items, 4367, 
or fifteen per cent, were missed through being incorrectly judged. Of 
the false items, 6350, or twenty-seven per cent, were similarly missed. 
The percentage of false statements incorrectly judged was eight 


- tenths larger than the percentage of true statements incorrectly 


judged. For one true statement missed, practically two false state- 


~ ments were missed. 


CONCLUSIONS 


It is better to record a second judgment in a true-false examination 
than not to record it. The chances are two to one that the second 
judgment will be the correct judgment. It is much safer to change a 
judgment from true to false than vice-versa. 
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Similar conclusions have been reached in two other studies having 
dissimilar approaches. Lowe and Crawford! reached similar con- 
clusions on a smaller number of items. Their data were collected from 
an experiment in which the students knew nothing about the investi- 
gation involved. The percentage of judgments they found had been 
changed was five times as large as found in the present study. 

Matthews? concluded that: “Students should be informed that it 
pays in terms of score to check over all questionable items carefully 
in true-false and multiple choice types of tests rather than to trust 
to first impressions. They may expect to raise their scores at least 
twice as often as they lower them by changing their first responses 
when later judgment seems to justify it.” 

The writer has discovered empirically that a very small percentage 
of students need to study their idiosyncrasies regarding results from 
changing first judgments on true-false examinations. 





1 Lowe, M. L. and Crawford, C. C.: ‘‘ First Impressions versus Second Thought 
in True-False Tests.”” Journal of Educational Psychology, Vol. XX, March, 1929, 
pp. 192-195. 

2 Matthews, C. O.: ‘‘Erroneous First Impressions on Objective Tests.” - Jour- 
nal of Educational Psychology, Vol. XX. April, 1929, pp. 280-286. 
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ERROR IN THE USE OF THE STANDARD ERROR 


W. R. VAN VOORHIS 


The Pennsylvania State College 


The formula for the Standard Error of the difference betweer: means 
of samples has often been misused either because of the limitations of 
the data at hand or through failure on the part of the worker to recog- 
nize the true significance of the items involved in the formula. This 
common error lies in the usage of the standard deviations of the 
observed samples instead of those of the parent populations from which 
the samples were drawn whenever that information is available. Thus, 
in the formula 





Cm—m, = V 07m, + o 8 ns _ 27 120 mF m, (1) 


the two onm’s actually involve the standard deviations of the total 
distributions from which the compared samples were taken. Further- 
more, if the two samples are themselves drawn from the same totality, 
o, and oz are never distinct but refer to the one true standard deviation. 

The proof of the foregoing lies in a simple algebraic employment of 
the discovery that in the formula 


Co i: 
m J/n 


the o, is the standard deviation of the total population and not that of 
the sample from which the single mean was computed. 

The om for the sample drawn from the parent population p; 
involves the standard deviation of the totality p:, and the o, for the 
compared sample involves the standard deviation of the totality pz. 
It follows through substitution, therefore, that (1) involves the same 


true standard deviations. Hence, for two samples of nm; and nz 


observations 





o*, o*, Tpi7 ps 
mi—ms = : — or ma « 2 
o ~ + ns 12 —. (2) 
Strictly speaking, formula (2) is applicable only when the true 
standard deviations of the two parent populations are known. How- 








1 Peters, C. C. and Van Voorhis, W. R.: ‘‘A New Proof and Corrected Formulae 
for the Standard Error of a Mean and of a Standard Deviation.”’ The Journal of 
Educational Psychology, Vol. XXIV, November, 1933, No. 8, pp. 620-633. 
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ever, as these two values can not in general be determined, the assump- 
tion is usually made that the standard deviations of the observed 
samples are representative of the true standard deviations. Obviously, 
in many cases of limited sampling this assumption is a very violent one. 
This makes it important that the investigator be reasonably certain 
that the variabilities of the observed samples be characteristic of the 
total areas from which these samples were drawn. 

If the compared samples are thought of as belonging to the same 
parent population, the two standard errors of the mean would involve 
the same c,. Furthermore, the samples would be independent so that 
T,12 would equal zero. The only variation in this particular case would 
be the numbers in the samples. Formula (2) would then become 


1 1 


Tm—m, = Tp mM + Na (3) 
In view of the fact that the size of the standard deviation varies 
directly with the size of the sample, the c, in (3) should be based on all 
cases in both observed samples whenever the true standard deviation is 
not known. This procedure will give the best possible estimate of the 
true standard deviation and thus tend to give a value for om,—m, 
which is more nearly correct than that computed on the basis of the 
two separate standard deviations. 
If the number of cases is the same for each sample; i.e., if 
Ni = ne = n, formula (3) becomes 


Tp 


me (4) 


The application of (4) will in general be restricted. It should be 
noted, however, that the only assumptions involved in this latter 
formula are that the number of observations in each sample be the 
same and that the samples be drawn from the same parent population. 

Formula (2), (3), and (4) give larger standard errors than the one 
commonly employed, and, hence, lead to more conservative critical 
ratios. The main demand upon the research worker by the use of the 
proper formula will thus be one of more complete and typical data. 

One is never justified in using the standard deviations of the 
observed samples in comparing means whenever the standard devia- 
tions for the total distributions are known. The usage of the standard 
deviation of the sample in the formula for the standard error of the 


Tm—m, = 1.414 
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mean even when the (n — 1) instead of the 7 is used in the denominator 
connotes the assumption that the square of the standard deviation of 
the sample equals the mean of all the squared standard deviations 
obtainable from the totality of possible samples of n observations or 
measurements. When the true standard deviation is not available, 
the extent to which one can rely upon the typicalness of the sample 
standard deviation is as yet conjectural. These facts suggest the 
advisability of weighing critically the great odds in favor of true 
differences displayed so often in many experimental studies. 











THE LEARNING CURVE IN SOLVING A JIG-SAW 
PUZZLE: A TEACHING DEVICE 


LOUISE E. ALTENEDER 
State Normal School, Paterson, N. J. 


An experiment in the learning process, and in the construction of 
a learning curve, is of much help to the beginning student in educa- 
tional psychology. Various experiments have been suggested for this 
purpose, such as card sorting, substitution experiments, mirror-writing, 
typewriting, etc. The writer performed the following experiment in 
order to ascertain the value of a jig-saw puzzle for student experiments 
in the learning process. 

A jig-saw puzzle of one hundred pieces, of three-ply wood, the sub- 
ject of which was a picture of a landscape not familiar to the experi- 
menter, was used. 

Two series of thirty-five trials each were made, with an interval of 
two weeks between the two series. Each series was completed in four 
days of one week. The pieces were shaken together in a box and then 
dumped haphazardly on the table. The time was calculated from the 
moment the first piece was touched until the puzzle was completed. 
Even minutes were used: if there were more than thirty seconds after 
the last full minute, a minute was added, if less than thirty seconds, the 
seconds were dropped. A stop watch was used. 


DISCUSSION OF RESULTS 


A study of the Graph shows the following characteristics of the 
learning process: 

1. An initial spurt, shown in the graph by the decided drop in the 
first three trials. After the picture was completed the first time numer- 
ous associations were formed so that the time was considerably lessened 
in the second and third trials. 

2. A Period of Fluctuation.—Some of the associations were for- 
gotten, others retained. For some pieces of the puzzle there were no 
associations and various positions were tried, indicating trial and error 
learning. Associations were formed not only with relation to the 
picture, but with relation to the shapes of the various pieces. 

3. An increase in the number of correct associations, and decrease in 
the number of incorrect placement of pieces. 

4. Physiological Limit.—Toward the end of the first series of trials 
there was very little fluctuation, there being a steady decrease in time 
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from twelve to nine minutes. While there was some loss in time after 
the two weeks’ interim of no trials, the limit of nine minutes for the 
first series of trials was soon reached, and then lowered to eight, and 
then to seven minutes. For the last twenty trials of the second series 
the time remained fairly constant. However, a great deal of effort 
and attention was required to hold the seven minute limit. This 
might therefore be near the physiological limit, while nine minutes 
might be called a practical limit. 

5. The Plateau.—A period of no improvement, followed by further 
improvement was not present in the initial series, but did appear in the 
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Learning Curve in Solving Tig-Saw Puzzle 


. second series. (See trials 16-22.) There seemed rather to be steady 


improvement until the limit was apparently reached. 

As a result of this experiment the writer believes the jig-saw puzzle 
affords a simple and interesting experiment for students to perform in 
connection with a study of the learning process and the learning curve. 
There is much opportunity for forming and retaining of associations, of 
observing trial and error learning, of noting the value of sustained 
attention, and of other processes associated with learning. An 
approach to a physiological limit seems possible if at least thirty-five 
trials are made. A puzzle should be selected which has never been 
tried, and one which would not take more than an hour for the first 
trial. 














BOOK REVIEWS 


Max SmitH. The Relationship between Item Validity and Test Valid- 
ity. Teachers College Contributions to Education, Columbia 
University, 1934, pp. VII + 40. 


This investigation deals with the validity of two hundred items 
selected from the list of vocabulary items for CAVD. The items 
were administered in 1930 to three hundred seventy individuals 
amd the biserial correlation between the item and total score was 
determined. The items were ranked according to the size of the 
biserial correlation and the list was then broken into five sets. (1) 
Items 1—20, (2) items 61-80, (3) items 121-140, (4) items 181-200 and 
(5) the remaining items were used as the criterion. 

The correlations were then determined between the sets of items 
and the criterion and as expected set (4) the ‘“‘worst” items gave 
the lowest validity coefficient, .43. The test was applied to two other 
groups of students (1931 and 1932) and similar results were found. 
The best set of items (1) however, did not give quite as high validity 
coefficient as the “‘high” set, .87 vs. .88, (2). Smith concludes ‘‘In 
using item validity coefficients to construct alternate forms of a test 
the test maker will find it more economical to concentrate on eliminat- 
ing the ‘worst’ items rather than on utilizing only the ‘best’ items.” 
He suggests that vocabulary items with a biserial r less than .40 be 
discarded. 

Data on the effect of successively eliminating the lowest and next 
lowest set of items on the validity coefficients show that the validity 
is slightly improved by eliminating the “‘worst’’ items but is lowered 
significantly (statistically) when both of the two lowest sets are 
eliminated. He concludes ‘‘Consequently, item validity coefficients 
(if already calculated) should have been used to eliminate the ‘worst’ 
items but not the ‘low’ items.’”’ There are certain factors, however, 
that have not been given due weight, namely (1) that the criterion 
had been so chosen as to eliminate the twenty most valid items and 
twenty other of high validity, (2) that he is comparing tests of varying 
lengths, namely forty, sixty and eighty items. The practical problem 
is not how much the validity is lowered by eliminating items that 
have less validity but how much will the validity be raised by the 
addition of more good items. 

Smith’s conclusions are admirably worded to cover all cases, he 
says, ‘‘In view of the relatively slight validity improvement obtained 
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by their use, it seems that in many instances it would hardly be worth 
while to compute item validity coefficients. Under special conditions, 
however, if there were many iterhs of negative validity, if the test 
were to be very widely used, or if the investigator had extraordinary 
facilities for computing item coefficients—such computation might 
be justified.” The general tone of the monograph, other than the 
above statement is expressed by ‘‘ However, in general, a moderate 
amount of insight combined with care in the original selection of 
items will eliminate most items of negative validity.”’ On the con- 
trary Smith’s data on which more than a moderate amount of insight 
and care had been expended show that with these highly selected 
items fifty-four or twenty-seven per cent fail to meet his criteria of a 
biserial r or .40. Jack W. DuNLAP. 
Fordham University. 


EtHet Kawin. Children of Preschool Age. Chicago: University of 
Chicago Press, 1934, pp. XXV + 340. 


The author, who was Director of the Preschool Department of the 
Institute for Juvenile Research in Chicago, presents in this volume a 
report of the work of that department for the years 1930-1933 inclu- 
sive, together with three researches based on an analysis of the records 
of the department. Part I describes the services of the Institute 
for children of preschool age and presents summaries of seven case 
histories that are also available in pamphlet form. Part II consists of 
three researches of interest particularly to workers in the field of 
individual differences and mental tests. Investigators who would 
undoubtedly be interested in these studies might overlook them 
because of their inclusion in a volume having such a general title 
centered in the preschool field. 

The first study is entitled “Young Children of Low and High 
Socio-Economic Status: A Comparative Study of their Performance on 
the Merrill-Palmer Scale.’’ Although paternal occupation is not the 
criterion for differentiating the two groups, it is implied to be one of 
the chief items of socio-economic data. Records are incomplete on 
this point for over twenty per cent of the cases in both groups, and it is 
admitted that about twenty per cent of the cases considered to be of 
low socio-economic status, because of attendance at a settlement-house 
nursery school, actually came from families in the professional and 
managerial categories. The results are in substantial agreement with 
other studies in the literature indicating a superiority of the upper 





mn, ———_, te, Ae 








‘ith 
per 





Book Reviews 235 


social classes. The superiority is found to be greater on the Stanford- 
Binet than on the Merrill-Palmer Scale, and on the latter it is lessened 
when verbal items are omitted. These facts, the author considers, 
point to a verbal factor entering into the differences found. It is to be 
noted that had the sexes been equally represented in both groups even 
greater differences would probably have been found in favor of the 
upper social groups. The positive relationship between socio-economic 
status and intelligence which earlier writers have demonstrated seems 
to be mistaken by the author for perfect positive correlation which 
would justify accurate individual prediction. With the social worker’s 
sympathy for, and optimism about, individuals of the lower levels of 
society, the writer over-emphasizes the few exceptional cases and the 
amount of overlapping of the two groups, and never fully accepts the 
obvious implications of the group data. Serious gaps in the data, 
the omission of basic frequency distributions, and the incomplete 
presentation of the significance of differences at critical points, and the 
erroneous computation of it at others, are regrettable. 

The second study, ‘‘Social Adjustment in Children of Preschool 
Age”’ is concerned with the child’s relationship to other children not his 
siblings in relation to various factual items and subjective ratings on 
family relationships extracted from case records. This study is 
another illustration of the inadequacy of case record material for 
research purposes when the data have not been collected with a planned 
project in mind. Groups subjectively classified as ‘‘problem”’ and 
‘“‘well-adjusted”’ are compared on even more subjective trait ratings 
while many other factors are permitted to vary uncontrolled. 

In the third study, ‘‘An Analysis of Stanford-Binet and Merrill- 
Palmer Test Results for Children of Preschool Age,’ the author 
presents her soundestdata. Shedemonstrates again that the Stanford- 
Binet test is too easy at the range studied, and that the IQ is less 
constant for preschool than for older children. The Merrill-Palmer 
test appears to be better standardized. The correlation between the 
two tests was .78 for fifty-five cases tested two weeks apart. 

The most striking result reported, and one which throws the results 
of the first study open to still further question, is the low reliability 
found for the Merrill-Palmer Test. Coefficients range from .49 to .59 
for retests six months to a year later. They are no higher when raw 
scores with age constant are used instead of sigma scores. 

The book is most stimulating reading and brings to the fore in 
clear-cut and well-organized discussion a variety of basic problems. 
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Unfortunately the service records available did not afford adequate 

data with which to answer many of them, no matter how ingenious 

the analysis. DorotHEeA McCartTay. 
Fordham University. 


CoLeMAN R. GrirFiTtH. An Introduction to Educational Psychology. 
New York: Farrar and Rinehart, Inc., 1935, pp. XIV + 754. 


The industry of Griffith is astounding. Within the space of two 
months two of his books—‘‘ An Introduction to Applied Psychology” 
and the one under review—have arrived at the reviewer’s desk. Both 
works are long and both are documented with hundreds of references. 

Griffith’s main viewpoint is that it is best to develop an educa- 
tional psychology from a study of the development of children. 
Knowing their traits at different periods of life, it is then possible 
to say how they should be taught. The actual experimentation with 
children is still lamentably small, and the observations so far made 
on them show little agreement, but the method is undoubtedly sound. 
To overlook the fact that man is first and foremost an animal and 
that there is more reputable experimental material on animal learning 
than on human learning is to cast aside some valuable basic material. 
The ideal text would take its stand on the experimental findings of 
both animal and human learning. It would trace man, as it were, 
from the lowest brute to the human adult stage. 

Griffith’s plan of treatment is somewhat concentric. One finds 
a topic, say motivation, treated in half a dozen places. The plan 
has both its merits and drawbacks, but it may be somewhat confusing 
to the immature student of the subject. Each section, as well as 
each chapter, starts with an introduction, so there are scores of 
introductions dotted around. This also is somewhat confusing. The 


- gonclusions of experimental findings are given, but the evidence on 


which they are based is frequently omitted. Instead, the references 
are given, but no student could consult all the references given in 
a dozen years. Would it not have been better to have selected the 
typical study or studies, given the evidence and the conclusions, and 
added the references for deeper study at the end of the chapters? 
Nevertheless, the text is a gallant attempt to write the ideal 
text on educational psychology. It is a remarkably good text, but 
if it falls short of the ideal, this criticism is one that could be made 
of every text in the field. For the more advanced student of the 
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subject the text can be confidently recommended. It is probably 

somewhat too difficult for the beginner, although it bears the title 

“An Introduction to Educational Psychology.’’ The only error dis- 

covered in the seven hundred odd pages was one on p. 299 which 

referred to McDougall’s list of instincts. He hag expanded the 

number to fourteen in his later writings. PETER SANDIFORD. 
University of Toronto. 


EpMUND 8S. ConkKLIN. Principles of Adolescent Psychology. New 
York: Henry Holt & Co., 1935, pp. XII + 437. 


“There is a period of several years in the life of every human 
being when he is no longer a child nor is he yet a mature adult.”’ 
More than a quarter of a century ago G. Stanley Hall published an 
elaborate two volume treatise on the psychology of this period. 
Since that time there has been a dearth of text books on adolescence, 
although the quantity of research literature has been literally enor- 
mous. Somewhat in the tradition of Hall, but with a critical insight 
into, and a capable grasp of the modern research and theoretical 
contributions, Dr. Conklin has written a treatment of the psychological 
problems of adolescence which must inevitably be recognized as a 
landmark in this important aspect of genetic psychology. 

The important psychological problems of the adolescent period 
are those concerned with the development of personality. This being 
admitted the essential thesis of the book might be stated thus: While 
personality is being moulded throughout infancy and childhood, yet 
the years of adolescence are uniquely critical because, on the one 
hand, there are disturbing physiological and anatomical changes, and 
on the other the individual is in the anomaious social position expressed 
in the sentence quoted at the beginning of this review. This thesis 
is developed by consideration of influences operating on the individual 
and the results of those influences on behavior. The problems dis- 
cussed include physical maturation, adolescent interests and ideals, 
social adjustment, conflicts, and delinquencies, the family influences, 
romantic love, religious adjustments, and ‘‘ abnormalities of personality 
organization and adjustment.’’ The reader is constantly aware that 
this author is not writing of these problems from the viewpoint of the 
library and study, but must have spent considerable time dealing 


directly—clinically one might say—with many adolescents and their 
problems. 
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Because of his own interests this reviewer cannot help expressing 
disappointment with references or bibliographies. Although the 
author states that his bibliography on the subject runs well over 
two thousand titles, he gives a very small percentage of them. Surely 
for the expert to present a critically selected list of references espe- 
cially to original literature, is a distinct service to scientific progress. 
Conklin has done this only in the meagerest sense. Also the biblio- 
graphical footnotes are inconsistent in typographic style, and often 
incomplete in citation. Journal references are uniformly complete, 
but book citations never give the publisher, and often lack place 
and/or date. 

The final word must be a hearty recommendation to teachers of 
adolescent psychology that they carefully consider this book as a 
text, and to high school administrators and teachers that they read 
the book as a practical help with their problems. C. M. Lovurttir. 

Indiana University. 


Ruts C. Peterson anp L. L. THurstone. Motion Pictures and the 
Social Attitudes of Children. New York: The Macmillan Co., 
1933, pp. XVII + 75. 


Frank K. SHUTTLEWORTH AND Mark A. May. The Social Conduct 
and Attitudes of Movie Fans. New York: The Macmillan Co., 
1933, pp. 142. 


HERBERT BLUMER AND Puitip M. Hauser. Movies Delinquency and 
Crime. New York: The Macmillan Co., 1933, pp. XIII + 233. 


Motion pictures appeal to adults as well as children, but it is 
their appeal to children that worries adults. One of the chief reasons 
for the organization of the Payne Fund studies of motion pictures 
and youth, of which the studies reviewed here are parts, is to investi- 
gate the influence movies have on the conduct, ideals and attitudes 
of children. The work of the Committee on Educational Research 
of the Payne Fund, with a membership of seventeen investigators 
chairmaned by Dr. W. W. Charters, purposed to serve as an illustration 
of an interesting technique for studying a social problem. Its dis- 
tinctive characteristic is, in the words of the chairman, “to analyze 
a complex social problem into a series of subordinate problems, to 
select competent investigators to work upon each of the subordinate 
projects and to integrate the findings of all the investigators as a 
solution of the initial problem.”’ 
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Of the three studies, Motion Pictures and the Social Attitudes of 
Children, reported by Doctors Thurstone and Peterson, is the most 
quantitative. The instruments used in this study were attitude scales 
and comparison schedules. These scales were given to groups of 
children before and after a select picture had been shown, the affective 
value being measured by the difference on the before and after meas- 
ures. Results are reported in terms of measures of central tendency 
and measures of variability. The cumulative effect of the pictures 
used was measured by the use of the same scales in a similar man- 
ner. The issues studied included attitude towards nationality and 
race, crime, punishment of criminals, capital punishment and prohi- 
bition. Typical of the conclusions found are: The most striking 
change of attitude found in these experiments was the change of 
attitude with regard to negroes as a result of the showing of the picture 
“The Birth of a Nation”; “All Quiet on the Western Front” is more 
potent as an instrument for directing children’s attitudes away from 
war and towards pacifism than is ‘‘ Journey’s End.” 

The Social Conduct and Attitudes of Movie Fans, a study made 
by Doctors Shuttleworth and May of Yale, which in published form is 
combined with the previous study, attacks a problem similar in nature 
but more general in significance than the foregoing study. That is, 
the object of this study is to get a measure of the influence not of 
attitudes developed by specific movies but a general picture of what 
influence the total-motion-picture-experiences of children have on 
their attitudes. To get at this problem Doctors Shuttleworth and 
May decided to abandon the experimental method and adopted the 
survey method. To overcome the limitations of this method because 
of the lack of controls, the investigators attempted to equate their 
groups and make detailed analyses of differences found within the total 
group. Some of their general findings are: Movie children average 
lower deportment records, do on the average poorer work in their school 
subjects, are rated lower in reputation by their teachers on two rating 
forms, are rated lower by their classmates on the “‘Guess Who”’ test, 
are less cooperative and less self-controlled as measured both by 
ratings and conduct tests, are slightly more deceptive in school 
situations, are slightly less skillful in judging what is the most useful 
and helpful and sensible thing to do, are slightly less emotionally 
stable, are mentioned more frequently on the ‘(Guess Who” test 
as a whole and are named more frequently as best friends by their 
classmates. One of the most significant differences about which the 
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authors make much ado is the fact that movie children desire to be 
“‘a popular actor’? much more frequently than non-movie children, 
and, conversely, non-movie children desire to be ‘‘a college professor’’ 
much more frequently than do movie children. 

The level of description used by Blumer and Hauser in their 
study, Movies, Delinquency, and Crime, is sociological—sociological 
in the general setting of the problem, the methods used and interpre- 
tation of findings made. Methods used include autobiographical 
accounts, personal interviews and questionnaires. Subjects inves- 
tigated included young criminals in a large state reformatory, ex-con- 
victs, most of them on parole, girls and young women delinquents 


_ in astate training school and delinquent boys. Not all of the methods 


were used on all of the subjects. One of the main general conclusions 
of this study is that the potency of the movies in transmitting a 
heritage is in inverse proportion to the strength of the family, neigh- 
borhood and church. An interesting finding concerning the way of 
transmission is the fact that, contrary to the general will-to-believe 
of alarmists in the moral field, ‘‘moral”’ endings of pictures in which 
wrong-doing is punished and virtue rewarded does not overshadow 
the existing little episodes within the movie itself. In fact, according 
to the investigators, ‘‘nothing is clearer than the frequency with 
which details or elements of the picture may be picked out as domi- 
nantly significant to the exclusion or minimizing of the terminating 
episode.” 

A general warning that is not heeded by many people who are 
making use of the Payne Fund studies to justify their convictions 
and not sufficiently emphasized by enough of the investigators, is 
contained in the last paragraph preceding the summary and interpre- 
tation chapter of the Shuttleworth and May study. The paragraph 
is: “‘Factors of age, intelligence, school grade, and home background 
are as important «id possibly more important in influencing the 
conduct and attitudes of children as the movie. In the case of atti- 
tudes, the influence of the community far overshadows in importance 
the influence of the movie.” In fact, the general impression of the 
reviewer is that the Payne Fund studies are in need of supplementation 
by more genetic studies of whole personalities in which movies are 
considered as one of the factors influencing the direction of personality 
growth. H. MELTzeEr. 

Psychological Service Center, St. Louis, Missouri. 














